Showing posts with label elasticsearch. Show all posts
Showing posts with label elasticsearch. Show all posts

Monday, October 01, 2018

Trip Report (sort of): RELX Search Summit 2018


Last week, I was at our London office attending the RELX Search Summit. The RELX Group is the parent company that includes my employer (Elsevier) as well as LexisNexis, LexisNexis Risk Solutions and Reed Exhibitions, among others. The event was organized by our Search Guild, an unofficial special interest group of search professionals from all these companies. As you can imagine, search is something of a big deal at both LexisNexis and Elsevier, given several large, well known search platforms such as Lexis Advance, ScienceDirect, Scopus and Clinical Key.

There were quite a few interesting presentations at the Search Summit, some of which I thought was quite groundbreaking from an applied research point of view. I was going to write up a trip report for that when I realized that at least some of the talks probably represented competitive information that would not be appropriate for a public forum. Besides, it is very likely that it would of limited interest anyway. So I decided to write this post only about the two presentations that I did at the Summit. Neither of them are groundbreaking, but I think it might be interesting for most people in general. Hopefully you think so too.

The first presentation was a 3 hour tutorial session around Content Engineering. The theme I wanted to explore was how to identify keywords in text using various unsupervised and supervised techniques, and how this can improve search. As you know, my last job revolved around search driven by a medical ontology, where the ontology was painstakingly hand-curated by a team of doctors, nurses and pharmacists over a span of several years.

Having an ontology makes quite a few things much easier. (It also makes several things much harder, but we won't dwell on that here). However, the use case I was trying to replicate was where you have the content to search, but no ontology to help you with it. Could we, using a variety of rule-based, statistical and machine learning techniques, identify phrases that represent key ideas in the text, similar to concepts in our ontology? And how does this help with search?

The dataset I used was the NIPS (Neural Information Processing Systems) conference papers from 1987 to 2017. The hope was that I would learn something about the cool techniques and algorithms that show up at NIPS, just by having to look at the text to debug problems, although it didn't quite work out the way I had hoped. I demonstrate a variety of techniques such as LLR (statistical), RAKE (rule based) and MAUI (machine learning based), as well as using Stanford and SpaCy NER models, and duplicate keyword detection and removal using SimHash and Dedupe. In addition, I also demonstrate how to do dimensionality reduction using various techniques (PCA, Topic Modeling, NMF, word vectors) and how they can be used to enhance the search experience. In addition, another point I was trying to make was that there is plently of third party open source tools available to do this job without significant investment in coding.

All the techniques listed above are demonstrated using Jupyter notebooks. In addition, I built a little Flask based web application that showed these techniques in action against a Solr 7.3 index containing the NIPS papers. The web application demonstrates techniques both on the query parsing side where we rewrite queries in various ways to utilize the information available, as well as on the content side, where the additional information is used to suggest documents like the one being viewed, or make personalized reading recommendations based on the collection of documents read already.

The presentation slides, the notebooks and the web application can all be found in my sujitpal/content-engineering-tutorial project on Github. Several new ideas were suggested by participants during the tutorial, since many of them had been looking at similar ideas already, so it morphed into a nice interactive workshop style discussion. I hope to add them in as I find time.

My second presentation was around Learning to Rank (LTR) basics. I had recently become interested in LTR following my visit to the Haystack Search Relevancy conference earlier this year, coupled with a chance discovery that a content-based recommender system I was working to help improve had around 40,000 labeled query document pairs, which could be used to improve the quality of recommendations.

The dataset I chose for this presentation was The Movie DataBase (TMDB), a collection of 45,000 movies, 20 genres and 31,000 unique keywords. The idea was to see if I could teach a RankLib LambdaMART model the ordering given by the rating field on a 10 point continuous scale. In a sense, the approach is similar to this LTR article using scikit-learn by Alfredo Motta. Most LTR datasets just give you the features dataset in LETOR format to train your ML models, so you can't actually do the full end-to-end pipeline.

In any case, the presentation starts off with a little bit of historical context, talks about different kinds of LTR models (pointwise, pairwise and listwise), some common algorithms that people tend to use, some advice to keep in mind when considering building an LTR model, ideas for features, the LETOR data format, etc. Most of the meat of the presentation is the creation of an LTR model using the Solr 7.4 and Elasticsearch 6.3.1 plugins, as well for a hypothetical indexer with no LTR support (I used Solr, but did the feature generation outside the indexer). I was hoping to cover at least one of the case studies but ran into technical difficulties (my fault, I should have listened to the organizers when they said to put everything in the slides).

Essentially, the methodology is similar for either of the 3 case studies, the main differences are in syntax (for Solr vs Elasticsearch). First we need a set of queries with a ranked list of documents for each. I used the ratings to create categorical query-document labels on a 5 point scale as explained earlier.

Once the data is loaded, the first step is to define the features to Solr and Elasticsearch - features are specified as function queries. We then generate the feature values from Solr or Elasticsearch by running the queries against these function queries and writing the features into a file in LETOR format. The reason we use the index is mostly to generate the query-document similarity features. For a system without LTR support, this can be done less efficiently outside the index as well.

The LETOR format was originally used by the LTR model suite RankLib (provides 8 different LTR models), and has since been adopted by most other third party LTR models. I trained a RankLib LambdaMART model for all 3 cases. Model training has to happen using third party algorithms, of which RankLib is one. The output of RankLib is an XML file whose format varies depending on what kind of model it represents. For a linear model, it is just a set of coefficients for each of the features defined. For RankNet, a neural network, it is a weight matrix that transforms the incoming features into a set of rank probabilities. For LambdaMART, which is a forest of decision trees, it is a set of trees, each with splits defined for various levels.

Once the model is trained, it has to be uploaded to Solr or Elasticsearch. Solr needs the model to be specified in JSON format, so you need to write some code to convert the XML to JSON, while Elasticsearch will accept the XML definition of the trained model without any conversion. You can now use the rerank functionality in Solr or Elasticsearch to rerank the top slice of a base query. For indexers that don't have LTR support, you will have to generate the search results, extract and rerank the top slice using your trained model, and add it back into the search results yourself.

Notebooks and scripts describing each of the case studies as well the presentation slides can be found in my sujitpal/ltr-examples repository on Github. My main objective here was to understand how to "do" LTR using Solr and Elasticsearch, so I didn't spend much time trying to improve results. Perhaps if I can find a bigger labeled dataset to play with, I might revisit this project and try to evaluate each platform in more detail, would appreciate suggestions if you know of any such datasets. Note that standard LTR datasets such as MSLR-WEB10K just provide the features in LETOR format, so it is all about the part where you train and evaluate the LTR model, nothing to do with showing the results on the search index. What I am looking for is a dataset with labeled query-document pairs.

That's all I have for today. This week I am at RecSys 2018 at Vancouver, hoping to learn more about recommendation systems from the very best in the field, and meet others working in the recommendation space. Do ping me if you are here as well, would be nice to meet face to face.


Saturday, June 18, 2016

Comparison of Image Search Performance using different kinds of vectors


My previous post was #5 on Hacker News Friday last week. Thanks to the person who thought it was good enough to post out there, and also to the folks who contacted me personally to congratulate me. I also wanted to thank the commenters for their constructive suggestions. While I was planning to do (and write) a bit more around this subject anyway, part of this post is based on your comments on the Hacker News thread.

In that post, I described a setup for image search using intensity histograms represented as a vector of "words" and associated payloads. Each histogram was created by binning the RGB pixel intensities into 75 equally spaced buckets (25 per channel). The idea is to make images searchable using a text search engine (I used Elasticsearch for my example). From a text search viewpoint, each image is a single "sentence" of upto 75 "words". Instead of using Lucene's standard Payload scores for ranking hits, I use a custom Function Query to compute Cosine Similarity between the query image and the results.

In this post, I will describe some other vector spaces into which I tried mapping my images. The vector spaces I tried are as follows:

  • Binning on RGB space (baseline).
  • Binning on HSV (Hue, Saturation, Value) space.
  • Hand-crafted Image Features (Intensity Histogram, Edge Histogram, Entropy).
  • Transfer Learning using a Deep Learning model (TL6 and TL7).

For evaluating the different approaches, I reused the evaluation framework I built for RGB vectors. The evaluation is done using a test set of 50 random images from the corpus. For each image, I try to find its rank in the results for the query. Ideally, it should be ranked at #1, so my success criteria is how close the Mean Reciprocal Rank (MRR) metric is to 1. Further, I vary the number of features that I use in my query, where the features are arranged in order of their importance to the image. The best approach would be the one that reaches the highest MRR with the least number of features. The chart below shows the results.


Note that since the size of the feature space is different for different approaches, the X-axis represents about 50% of the feature space available to each approach. For example, for RGB and HSV, the maximum number of features is 75, so the range shown represents 1-40 features for RGB. On the other hand for Transfer Learning 6 and 7 (tl6 and tl7), the number of features are 4096, so the visible range for these two is about 1-2000.

Clearly, the winner here is HSV (green), followed by TL7 (magenta) and TL6 (cyan) respectively. However, both TL6 and TL7 show better performance with smaller number of features. The Hand crafted Image Features (IMF) approach (red) fares the worst, very likely because I don't know much about image features. The RGB (blue) plateaus at around 0.8 for 35 features, same as what I noted in my last post.

Below, I describe each approach in more detail.

The Hue, Saturation, Value space (HSV) approach


I selected this color space based on the suggestion in Adrian Rosebrock's Complete Guide to Building an Image Search Engine, that while RGB is easy to understand, it fails to mimic how humans perceive color. The image in HSV has the same shape as the one in RGB, so I can reuse most of our code to vectorize the image.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Source: src/es_build_hsv.py
# -*- coding: utf-8 -*-
import collections
import cv2
import numpy as np
import operator
import os

IMAGE_DIR = "/path/to/images/of/butterflies/"
OUTPUT_FILE = "/path/to/butterflies_hsv.txt"

def get_vector_hsv(img):
    axes = ["h", "s", "v"]
    bins = np.linspace(0, 256, 25)
    means = 0.5 * (bins[1:] + bins[:-1]).astype("uint8")
    words = []
    for i in range(len(axes)):
        px_orig = img[:, :, i].flatten()
        labels = np.searchsorted(bins, px_orig)
        px_reduced = np.choose(labels.ravel(), means, 
                               mode="clip").astype("uint8").tolist()
        counter = collections.Counter(px_reduced)
        words.extend([(axes[i] + str(x[0]), x[1]) for x in counter.items()])
    words_sorted = sorted(words, key=operator.itemgetter(1), reverse=True)
    max_freq = words_sorted[0][1]
    words_sorted = [(x[0], 100.0 * x[1] / max_freq) for x in words_sorted]
    return words_sorted

fout = open(OUTPUT_FILE, 'wb')
h, s, v = [], [], []
for img_file in os.listdir(IMAGE_DIR):
    image = cv2.imread(os.path.join(IMAGE_DIR, img_file))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    image_vec = get_vector_hsv(image)
    words = [e[0] + "|" + ("%.5f" % (e[1])) for e in image_vec]
    fout.write("%s\t%s\n" % (img_file, " ".join(words)))
fout.close()

The output of this block of code is a flat file with filename and HSV image vector. Each line of the file consists of the image file name and the image HSV vector. The HSV vector is a sequence of elements separated by space. Each element consists of the index and the magnitude, and is represented in the index as a pseudo "word" and its payload. As a reminder, here is a sample of the HSV output. The general format of this file is identical for all approaches.

1
2
3
4
1132868439-1210.jpg v122|100.00000 h16|86.25048 s16|85.04542 ...
1132868439-12100.jpg v122|100.00000 s16|92.02476 h37|90.57395 ...
1132868439-12101.jpg v122|100.00000 s16|96.92287 h37|85.75985 ...
...

Hand Crafted Image Features (IMF)


I wasn't planning on doing this one, since I don't know much about Image Processing techniques. However, there were some comments on the Hacker News thread as well as a helpful comment on my post that came with a paper reference, that convinced me that I should try it out. I was already using OpenCV for the HSV stuff, so I figured it wouldn't be that hard. At the very least it would be a chance to learn something new.

So anyway, I looked at the suggested paper Dicoogle, a Pacs Featuring Profiled Content Based Image Retrieval and found some features that the author successfully used to classify radiology images. These were:

  • Intensity Histogram
  • Edge Histogram
  • Entropy
  • Segmentations and respective area and center of mass
  • Image momentums

Of these, I was able to compute everything except for the segmentations, although I am not sure I am using them correctly. If you think I should do this differently please let me know.

Intensity Histograms are not new, the RGB and HSV approaches are based solely on intensity histograms. However, in this case, I decided to compute black and white histograms only, because I needed to compute B/W images for the other metrics.

The Edge Histogram is interesting. Essentially, you compute gradients along the horizontal and vertical directions using the Sobel filter, then compute the angle of any edge along each pixel, which can be a value between +/- Pi radians, then bin the angles into a histogram. This mailing list discussion has more details. The diagram below shows the original image, the gradients in the two directions and the resulting edge histogram for a single image.


The entropy is based off Shannon's formula using the guidelines on this blog post. For momentums, I used OpenCV's built in function to compute momentums off the contours, and agreegate them for all contours in the image. The code for this is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Source: src/es_build_imf.py
# -*- coding: utf-8 -*-
import cv2
import math
import numpy as np
import operator
import os

IMAGE_DIR = "/path/to/images/of/butterflies/"
OUTPUT_FILE = "/path/to/butterflies_imf.txt"

def get_intensity_histogram(img):
    """ 25 histogram entries per image (gray scale) """
    int_hist = np.histogram(img.flatten(), bins=25, range=(0, 255))[0]
    row_sum = np.sum(int_hist)
    int_hist = int_hist * 1.0 / row_sum
    return int_hist

def get_entropy(img):
    """ Adapted from:
        http://areshopencv.blogspot.com/2011/12/computing-entropy-of-image.html
        and
        https://en.wikipedia.org/wiki/Entropy
    """
    freqs = cv2.calcHist([img], [0], None, [255], [0, 255]).T[0]
    row_sum = np.sum(freqs)
    freqs = freqs * 1.0 / row_sum
    return sum([-x * math.log(x, 2) for x in freqs.tolist() if x > 0])
    
def get_moments(img):
    """ Compute average moments across all contours in the image. """
    ret, thresh = cv2.threshold(img, 127, 255, 0)
    img2, contours, hierarchy = cv2.findContours(thresh, 1, 2)
    moments = {}
    for contour in contours:
        mom = cv2.moments(contour)
        for k in mom.keys():
            if moments.has_key(k):
                moments[k] += mom[k]
            else:
                moments[k] = mom[k]
    val_sum = sum([moments[k] for k in moments.keys()])
    for k in moments.keys():
        moments[k] = moments[k] * 1.0 / val_sum
    return moments

def get_edge_histogram(img):
    """ Adapted from this discussion:
        http://opencv-users.1802565.n2.nabble.com/Edge-orientation-td3146691.html
    """
    dx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=5)
    dy = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=5)
    grad = np.zeros((dx.shape[0], dx.shape[1]))
    for i in range(dx.shape[0]):
        for j in range(dx.shape[1]):
            grad[i, j] = math.atan2(dx[i, j], dy[i, j])
    edge_hist = np.histogram(grad, bins=25, range=(-math.pi, math.pi))[0]
    edge_sum = np.sum(edge_hist)
    edge_hist = edge_hist * 1.0 / edge_sum
    return edge_hist

def get_image_vector_imf(img):
    img_vec = []
    int_hist = get_intensity_histogram(img)
    img_vec.extend([("x" + str(i), x) 
                    for i, x in enumerate(int_hist.tolist())
                    if x > 0.0])
    edge_hists = get_edge_histogram(img)
    img_vec.extend([("e" + str(i), x) 
                    for i, x in enumerate(edge_hists.tolist())
                    if x != 0.0])
    moments = get_moments(img)
    for k in moments.keys():
        if moments[k] != 0.0:
            img_vec.append((k, moments[k]))
    entropy = get_entropy(img)
    img_vec.append(("ent", entropy))
    words = ["%s|%.5f" % (k, v) for k, v in 
        sorted(img_vec, key=operator.itemgetter(1), reverse=True)]
    return words
    
fout = open(OUTPUT_FILE, 'wb')
for img_file in os.listdir(IMAGE_DIR):
    print img_file
    img = cv2.imread(os.path.join(IMAGE_DIR, img_file), 0)
    words = get_image_vector_imf(img)
    fout.write("%s\t%s\n" % (img_file, " ".join(words)))
fout.close()

Transfer Learning using a Deep Learning Model (TL6 and TL7)


I used the reference Convolutional Neural Network (CNN) model built with Caffe and pre-trained against the ImageNet dataset of approximately 14 million images, to generate image vectors for my corpus of 200 butterfly images. Since the CNN has been trained to recognize many different kinds of features, it has developed the ability to extract good features from images - so essentially I use it as a feature extractor for my corpus of 200 butterfly images, even though none of them probably exist in ImageNet and the CNN has never seen them before. This article Recycling Deep Learning Models with Transfer Learning by Zachary Lipton has a good explanation of what Transfer Learning is and why it works so well.

Conceptually, the Caffe reference CNN looks like this. Each line represents a layer in the CNN. The second column represents the dimensions of the layer. The first element of the dimension tuple is the batch size, and the rest represent the actual dimension of the layer (for each input record). Input comes in through the "data" layer at the top and the class prediction probabilities are generated by the "prob" layer at the bottom. CNNs consist of alternating layers of convolutions and pooling, ending finally in one or more fully connected (fc) layers. In this case, there are 3 fully connected layers ("fc6", "fc7" and "fc8"). Of these, "fc8" is not very interesting because it has 1000 elements same as the "prob" layer, and therefore too closely connected to the ImageNet classes to be useful to us.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data (50, 3, 227, 227)
conv1 (50, 96, 55, 55)
pool1 (50, 96, 27, 27)
norm1 (50, 96, 27, 27)
conv2 (50, 256, 27, 27)
pool2 (50, 256, 13, 13)
norm2 (50, 256, 13, 13)
conv3 (50, 384, 13, 13)
conv4 (50, 384, 13, 13)
conv5 (50, 256, 13, 13)
pool5 (50, 256, 6, 6)
fc6 (50, 4096)
fc7 (50, 4096)
fc8 (50, 1000)
prob (50, 1000)

The way the transfer learning setup works is that I take the trained model weights, and chop off the "fc8" and "prob" layers from the network, and run my butterfly images through the truncated network. This will result in a vector (output of "fc7") of size (4096,). We use this vector to represent our image for TL7.

For TL6, I chop off the "fc7" layer as well, and run my images through the resulting truncated network. Once again, the result is a vector of size (4096,) from "fc6" for each image.

The code to extract image vectors has been adapted from this Caffe classification example notebook, and is similar for TL6 and TL7. The only thing that changes is the protobuf file name (MODEL_DEF) for the network deployment descriptor and the layer name to get the vectors out of at the end (line 54). You will need to install Caffe and pycaffe for this work. Also, to import pycaffe as I have done in the code below, I had to add the location of pycaffe to my PYTHONPATH (ie, export PYTHONPATH=$PYTHONPATH:$CAFFE_HOME/python). Installation was mostly painless with Ubuntu 15.10 and Anaconda Python, the two things that I encountered were solved using solutions from here and here. Also since I am just running the model (as opposed to training it), a GPU box is not needed.

The original model definition for the MODEL_DEF assignment in the code is in this deploy.prototxt file. For running with TL7, I needed to remove the lines 184-213 (ie, everything below the layer named "fc7"), and for TL6, the lines 160-213 (ie, everything below the layer named "fc6").

My inputs were of size (640, 480, 3), but the CNN takes images of size (3, 227, 227) so I had to resize my images and swap the axes to make it compatible with the CNN. I also had to center the images along the mean. Caffe provides a "mean image" which looks a bit like this if you visualize it (unlike my intuition that its just a matrix of 128s).


The whole process of resizing, swapping axes, mean-centering and feeding the resulting matrix to the CNN is done using a Caffe Transformer. Once we define the transformer, we just loop through our images and run the CNN over each image, finally collecting the image vector at the other end.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Source: src/es_build_tl.py
# -*- coding: utf-8 -*-
import caffe
import matplotlib.pyplot as plt
import numpy as np
import operator
import os
import scipy.misc

IMAGE_DIR = "/path/to/butterflies/"
RESIZED_IMAGE_DIR = "/path/to/butterflies_227x227"
OUTPUT_FILE = "/path/to/butterflies_tl7.txt"

CAFFE_HOME = "../"    # in caffe/examples
MODEL_DEF = CAFFE_HOME + "models/bvlc_reference_caffenet/deploy_fc7.prototxt"
MODEL_WEIGHTS = CAFFE_HOME + "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel"
MEAN_IMAGE = CAFFE_HOME + "python/caffe/imagenet/ilsvrc_2012_mean.npy"

# resize input files to 227x227, since Caffe network expects these dimensions
for img_name in os.listdir(IMAGE_DIR):
    img_src = plt.imread(os.path.join(IMAGE_DIR, img_name))
    img_res = scipy.misc.imresize(img_src, (227, 227))
    plt.imsave(os.path.join(RESIZED_IMAGE_DIR, img_name), img_res)

# Caffe model setup - I modify the supplied deploy.prototxt file to remove
# the last two layers "fc8" and "prob". Output of "fc7" is a (4096,) vector
caffe.set_mode_cpu()
net = caffe.Net(MODEL_DEF, MODEL_WEIGHTS, caffe.TEST)

# get network configuration configuration
for layer_name, blob in net.blobs.iteritems():
    print layer_name, str(blob.data.shape)
    
# input preprocessing (subtract mean, align axes, etc)
mu = np.load(MEAN_IMAGE)
mu = mu.mean(1).mean(1)
transformer = caffe.io.Transformer({"data": net.blobs["data"].data.shape})
transformer.set_transpose("data", (2, 0, 1))
transformer.set_mean("data", mu)
transformer.set_raw_scale("data", 255)
transformer.set_channel_swap("data", (2, 1, 0))

# set size of input (batch size is 50, even though I just classify 1 by 1)
net.blobs["data"].reshape(50, 3, 227, 227)

# run image through truncated network and write (4096,) image vector
fout = open(os.path.join(OUTPUT_FILE), 'wb')
for img_name in os.listdir(RESIZED_IMAGE_DIR):
    print img_name    
    image = caffe.io.load_image(os.path.join(RESIZED_IMAGE_DIR, img_name))
    transformed_image = transformer.preprocess("data", image)
    net.blobs["data"].data[...] = transformed_image
    output = net.forward()
    img_vec = output["fc7"][0]
    words = [(i, x) for i, x in enumerate(img_vec.tolist()) if x > 0.0]
    sorted_words = sorted(words, key=operator.itemgetter(1), reverse=True)
    words = ["%s|%.5f" % (i, x) for i, x in sorted_words]
    fout.write("%s\t%s\n" % (img_name, " ".join(words)))
fout.close()

The output for both the TL6 and TL7 runs is a file of image names and their associated vectors extracted from the respective truncated CNNs. In both cases, image vectors have shape (4096,).

Loading into the index and Evaluation


Since the data formats are identical, I can reuse the code for loading the index across all the approaches. I decided to store the data into the same index, under different schemas. So I needed to update my mapping section to explicitly list each schema, and the associated field definitions.

Once the data was in Elasticsearch, the following code was used to select 50 random images, and run it against each schema with varying number of features, and finally draw the chart shown earlier in the post.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Source: src/eval_feats.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
import elasticsearch
import matplotlib.pyplot as plt
import numpy as np
import os

import image_search_utils

IMAGE_DIR = "/path/to/data/for/butterflies"
OUTPUT_DIR = "/path/to/output"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"

# gather 50 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, 50, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
    if curr_idx in test_image_ids:
        test_images.append(fname)
    curr_idx += 1

# Run the evaluation
# for each doctype, I find the max number of non-zero features per doctype
# for our corpus and then cap the number of features. Then for each doc type
# I query with each test image varying the number of features and record the
# MRR for each result. MRR is averaged across the test set for each doc type
# and number of features combination and plotted
es = elasticsearch.Elasticsearch(hosts = [{
    "host": ES_HOST,
    "port": ES_PORT
}])
feature_sizes = {  # approx number of max non-zero entries per doc_type
    "rgb": 40, 
    "hsv": 40,
    "imf": 100,
    "tl6": 1000,
    "tl7": 1000
}
for doc_type in feature_sizes.keys():
    print("Evaluating %s" % (doc_type))
    fout = open(os.path.join(OUTPUT_DIR, "eval-%s.txt" % (doc_type)), 'wb')
    num_feats_list = np.linspace(0, feature_sizes[doc_type], 20).tolist()
    for num_feats in num_feats_list[1:]:
        num_feats = int(num_feats)
        print("... #-features: %d for doc_type: %s" % (num_feats, doc_type))
        result_mrr = 0
        for test_image in test_images:
            print("...... Querying with %s (#-feats: %d, doc_type: %s)" % 
                (test_image, num_feats, doc_type))
            qdata = image_search_utils.search_by_id(test_image, es,
                                                    ES_INDEXNAME, doc_type)
            # build vector for query image
            q_vec = []
            for ws in qdata[0]["imgsig"].split(" "):
                word, score = ws.split("|")
                q_vec.append((word, score))
            results = image_search_utils.search(q_vec, es, ES_INDEXNAME, 
                                                doc_type, num_feats)
            result_filenames = [result[0] for result in results]
            for rank, result_image in enumerate(result_filenames):
                if result_image == test_image:
                    result_mrr += (1.0 / (rank + 1))
                    break
        fout.write("%d\t%.5f\n" % (num_feats, result_mrr / len(test_images)))
    fout.close()

# Plot the results
doc_types = ["rgb", "hsv", "imf", "tl6", "tl7"]
for doc_type in doc_types:
    feval = open(os.path.join(OUTPUT_DIR, "eval-%s.txt" % (doc_type)), 'rb')
    xs, ys = [], []
    for line in feval:
        x, y = line.strip().split("\t")
        xs.append(x)
        ys.append(y)
    feval.close()
    plt.plot(range(len(xs)), ys, label=doc_type)
plt.xticks([])
plt.xlabel("Number of features")
plt.ylabel("Mean Reciprocal Rank (MRR)")
plt.grid(True)
plt.legend()

Finally, I needed to add one more function to look up an image by ID from the index instead of generating the vector for the query image using code. Here is the additional method.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def search_by_id(filename, es, index_name, doc_type):
    query = """
{
    "from": 0,
    "size": 1,
    "query": {
        "match": {
            "filename": "%s"
        }    
    }
}    
    """ % (filename)
    resp = es.search(index=index_name, doc_type=doc_type, body=query)
    hits = resp["hits"]["hits"]
    return [hit["_source"] for hit in hits]

Thats all I have for today, hope you enjoyed reading it. If you have suggestions about other vector spaces that have worked well for you with image search, would love to hear about them.


Sunday, May 29, 2016

Elasticsearch based Image Search using RGB Signatures


In my previous post, I described some experiments I was doing to reduce images to a Bag of Visual Words (BOVW). My goal is to build a Content Based Image Retrieval (CBIR), i.e., a system that searches images based on their pixel content rather than text captions or tags associated with them. Furthermore, I would like to use a standard text search engine to do this - a lot of effort has gone into making these engines robust and scalable. So if I model the image search as a text search over a BOVW and deploy it to one of these engines, I get the robust and scalable part for free.

Just like text search, image search is also a balance between precision and recall. You do want the "right" result to appear on top, but you also want to see other results "like" the one you asked for. My attempt to model this fuzziness is to bin the pixels into coarser buckets along each channel; that way this post-processed image looks a little more like other images (thus improving recall) but still looks more similar to similar images than dissimilar images (thus not impacting precision too much).

In my previous post, I had experimented with KMeans clustering for binning the pixels along each channel, but it turned out to be too intensive when run across my entire butterfly corpus of 200 images from Photorack. In any case, I read later that KMeans is not very effective in a single dimension, so I switched to binning the pixels along each channel into 25 equal sized bins. At the end of this process, each image becomes a document composed of a vocabulary of 75 unique "words". Here is the code to read the images and write the corresponding data for loading into a search index.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# -*- coding: utf-8 -*-
# Source: es_build_rgb.py
import matplotlib.pyplot as plt
import os

import image_search_utils

INPUT_DIR = "../data/butterflies"
OUTPUT_FILE = "../data/butterflies_rgb.txt"

fout = open(OUTPUT_FILE, 'wb')
for fname in os.listdir(INPUT_DIR):
    print("Processing file: %s" % (fname))
    img = plt.imread(os.path.join(INPUT_DIR, fname))
    words = image_search_utils.get_vector_rgb(img)
    words_str = " ".join([w[0] + "|" + ("%.3f" % (w[1])) for w in words])
    fout.write("%s\t%s\n" % (fname, words_str))
fout.close()

For convenience, I factored some of the common functionality into a utils package. One of these is the functionality to convert an image of shape (640, 480, 3) into a vector as described above. We need to do the same transformation on an incoming image as well, so this is also called by the search code. The code for the package is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# -*- coding: utf-8 -*-
# Source: image_search_utils.py
from __future__ import division, print_function
import collections
import numpy as np
import operator

def get_vector_rgb(img):
    """ Vectorizes an RGB image. Each channel's pixel intensities are
        binned separately into 25 equal size bins, and the frequencies
        counted. The end result is a vector of bin "pixel-words" that 
        are formed by one of ["r", "g", "b"] and the mid point of the
        bin. Each such pixel-word is associated with its frequency,
        arbitarily normalized to 100. Output is a list of (pixel-word,
        normalized-frequency) tuples ordered by highest to lowest 
        normalized frequency.
    """
    colors = ["r", "g", "b"]
    bins = np.linspace(0, 256, 25)
    means = 0.5 * (bins[1:] + bins[:-1]).astype("uint8")
    words = []
    for i in range(len(colors)):
        px_orig = img[:, :, i].flatten()
        labels = np.searchsorted(bins, px_orig)
        px_reduced = np.choose(labels.ravel(), means, 
                               mode="clip").astype("uint8").tolist()
        counter = collections.Counter(px_reduced)
        words.extend([(colors[i] + str(x[0]), x[1]) for x in counter.items()])
    words_sorted = sorted(words, key=operator.itemgetter(1), reverse=True)
    max_freq = words_sorted[0][1]
    words_sorted = [(x[0], 100.0 * x[1] / max_freq) for x in words_sorted]
    return words_sorted
    
def search(vec, es, index_name, doc_type, top_n=35, start=0, size=10):
    """ Does a payload search on the underly Elasticsearch index.
        The top N features of the image vector are used to search.
        The function query called implements cosine similarity so
        the output scores are between 1 and 0. Output is a list of
        file names and scores tuples, ordered by descending order 
        of score.
    """
    top_vec = vec[0:top_n]
    top_vec_as_text = " ".join([x[0] for x in top_vec])
    top_vec_as_param = ",".join(["\""+x[0]+"|"+str(x[1])+"\"" for x in top_vec])
    query = """
{
    "from": %d,
    "size": %d,
    "query": {
        "function_score": {
            "query": {
                "match": {
                    "imgsig": "%s"
                }
            },
            "script_score": {
                "script": {
                    "lang": "groovy",
                    "file": "payload",
                    "params": {
                        "params": [ %s ]
                    }
                }
            }
        }
    }
}
    """ % (start, size, top_vec_as_text, top_vec_as_param)
    resp = es.search(index=index_name, doc_type=doc_type, body=query)
    hits = resp["hits"]["hits"]
    return [(hit["_source"]["filename"], hit["_score"]) for hit in hits]

The output of the vectorization step looks something like this. Each line represents a single image. The record consists of the file name followed by a tab, then followed by 75 pixel words and their frequencies separated by a pipe. Each pixel word starts with either r, g, or b to indicate the layer. The number following the character is the mid point of one of the bins. The frequency has been somewhat arbitarily normalized to a top value of 100. In retrospect I don't think I needed to do this.

1
2
3
4
5
1132868439-121.jpg     g122|100.000 r122|81.176 b122|71.557 r112|31.578 ...
1132868439-1210.jpg    r122|100.000 g122|94.113 b122|92.171 b37|24.449 ...
1132868439-12100.jpg   r122|100.000 g122|93.931 b122|92.702 b37|16.903 ...
1132868439-12101.jpg   r122|100.000 g122|96.838 b122|95.064 g26|34.409 ...
1132868439-12102.jpg   b122|100.000 g122|95.313 r122|94.820 r69|17.510 ...

For the index, I used Elasticsearch (ES) 2.3.3. Looking at the data format above, you probably guessed that I plan to use Lucene's Payloads feature. I have used Payloads before with Solr, but I am using ES more nowadays, so I figured it would be good to explore how to use ES for Payloads as well.

With the help of the OReilly ES Book (a freebie from Elasticon 2016), and some awesome advice on StackOverflow, I was finally able to get this data loaded. The code for creating the index schema and loading the data is shown below. If you look past the boilerplate, I am basically just declaring the analyzer chain for payload data, and declaring that I have two fields in my index - filename which is a string (not text), and imgsig which is a payload field. Then I read the file I just generated and write the records into the index.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# -*- coding: utf-8 -*-
# Source: es_create_rgb.py + es_load_rgb.py
import elasticsearch

es = elasticsearch.Elasticsearch(hosts=[{
    "host": "localhost",
    "port": "9200"
}])
create_index = """
{
    "settings": {
        "analysis": {
            "analyzer": {
                "payloads": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "delimited_payload_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "rgb": {
            "properties": {
                "filename": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "imgsig": {
                    "type": "string",
                    "analyzer": "payloads",
                    "term_vector": "with_positions_offsets_payloads"
                }
            }
        }
    }
}
"""
resp = es.indices.create(index="butterflies", ignore=400, body=create_index)
print resp

fin = open("../data/butterflies_rgb.txt", 'rb')
line_nbr = 1
for line in fin:
    filename, imgsig = line.strip().split("\t")
    es.index(index="butterflies", doc_type="rgb", id=line_nbr, body={
        "filename": "%s" % (filename),
        "imgsig": "%s" % (imgsig)
    })
    line_nbr += 1
fin.close()

For querying the data, however, the advice on Stack Overflow did not work for me. Specifically, ES complained that it couldn't compile the Groovy script. I ended up moving the Groovy script out of the request and into the config/scripts directory of the ES server per the ES Scripting page docs. I also ended up modifying the script a little to make it emit cosine similarities scores instead of the sum of payload scores of the matched result as it was doing before. My initial objective was to make the score reflect the importance of the query image vector elements as well, but then I realized I could just normalize the number to get the cosine similarity. Here is the Groovy script for payload scoring. Some of the functions used are explained in more detail on the ES Advanced Scripting docs.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
numer = 0.0;
denom_l = 0.0;
denom_r = 0.0;
for (param in params) {
    def (word, weight) = param.tokenize('|');
    weight_l = weight.toFloat();
    denom_l = denom_l + (weight_l * weight_l);
    termInfo = _index["imgsig"].get(word, _PAYLOADS);
    for (pos in termInfo) {
        weight_r = pos.payloadAsFloat(0);
        numer = numer + (weight_l * weight_r);
        denom_r = denom_r + (weight_r * weight_r);
    }
}
return numer / Math.sqrt(denom_l * denom_r);

The image_search_utils package above contains the query to call this scoring function. A typical query looks like this. The query part matches the records which have the words r80 and r90 together, and the params provide the source words and frequencies to match with. Remember that each image can have a maximum of 75 features and are ordered by importance. So we can increase precision by increasing the number of features in our query image and increase recall by decreasing the number of features.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
    "query": {
        "function_score": {
            "query": {
                "match": {
                    "imgsig": "r80 r90"
                }
            },
            "script_score": {
                "script": {
                    "lang": "groovy",
                    "file": "payload",
                    "params": {
                        "params": [ "r80|10.0", "r90|10.0" ]
                    }
                }
            }
        }
    }
}

For data exploration, I built a little web application with CherryPy. The main page shows me thumbnails of all the images in my corpus. Clicking on an image does a search using that image as the query. Here is the code for the web application.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# -*- coding: utf-8 -*-
# Source: image_search_web.py
from __future__ import print_function
import cherrypy
import elasticsearch
import matplotlib.pyplot as plt
import os

import image_search_utils

IMAGE_DIR = "/full/path/to/image-search/data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"

class ImageSearchService(object):
    
    def __init__(self):
        self.img_fnames = self._load_images()
        self.es = elasticsearch.Elasticsearch(hosts=[{
            "host": ES_HOST,
            "port": ES_PORT
        }])
        self.index_name = ES_INDEXNAME
        self.doc_type = ES_DOCTYPE

    def _load_images(self):
        img_fnames = []
        for fname in os.listdir(IMAGE_DIR):        
            img_fnames.append(fname)
        return img_fnames

    @cherrypy.expose
    def index(self):
        html = ("""<table cellspacing="0" cellpadding="0" border="1" width="100%">""")
        for i in range(20):
            html += ("""<tr>""")
            for j in range(10):
                curr_fname = self.img_fnames[i * 10 + j]
                html += ("""<td><a href="/search?q=%s"><img src="images/%s" width="100" height="75"/></a></td>""" % 
                    (curr_fname, curr_fname))
            html += ("""</tr>""")
        html += ("""</table>""")
        return html

    @cherrypy.expose    
    def search(self, q):
        img = plt.imread(os.path.join(IMAGE_DIR, q))
        term_vec = image_search_utils.get_vector_rgb(img)
        results = image_search_utils.search(term_vec, self.es, self.index_name, self.doc_type)
        html = """<h3>Query:</h3><br/>"""
        html += """<b>Filename:</b> %s<br/>""" % (q)
        html += """<b>Image:</b><br/>"""
        html += ("""<img src="images/%s" height="75" width="100"/><br/><hr/>"""
            % (q))
        html += """<h3>Results</h3><hr/>"""
        html += """<table cellspacing="0" cellpadding="0" border="1" width="100%">"""
        html += """<tr><th>Rank</th><th>Filename</th><th>Image</th><th>Score</th></tr>"""
        rank = 1
        for result in results:
            html += """<tr valign="top">"""
            html += """<td align="center">%d</td>""" % (rank)
            html += """<td align="center">%s</td>""" % (result[0])
            html += """<td align="center"><img src="images/%s" height="75" width="100"/></td>""" % (result[0])
            html += """<td align="center">%.5f</td>""" % (result[1])
            html += """</tr>"""
            rank += 1
        html += """</table>"""
        return html
    
if __name__ == "__main__":
    cherrypy.config.update({
        "server.socket_host": "127.0.0.1",
        "server.socket_port": 8080
    })
    conf = {
        "/images": {
            "tools.staticdir.on": True,
            "tools.staticdir.dir": IMAGE_DIR
        }
    }
    cherrypy.quickstart(ImageSearchService(), config=conf)

The screenshots below show the search query page (also the home page) and a results page. Notice that the query image and first result image on the results page are identical. This is true for most searches I ran, although in a few cases the query result appeared in the second position, and in fewer cases, at lower positions.






In order to figure out how good the search was overall, I ran an evaluation to measure the change in Mean Reciprocal Rank (MRR), for a random set of 50 images, as I varied the number of query image features. Here is the code to do the evaluation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# -*- coding: utf-8 -*-
# Source: eval_feats_rgb.py
from __future__ import division, print_function
import elasticsearch
import matplotlib.pyplot as plt
import numpy as np
import os

import image_search_utils

IMAGE_DIR = "../data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"

# gather 50 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, 50, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
    if curr_idx in test_image_ids:
        test_images.append(fname)
    curr_idx += 1

es = elasticsearch.Elasticsearch(hosts = [{
    "host": ES_HOST,
    "port": ES_PORT
}])
mrrs = []
top_ns = range(0, 80, 5)
top_ns[0] = 1
for top_n in top_ns:
    print("Now running with top %d features..." % (top_n))
    mrr = 0.0
    for test_image in test_images:
        print("... querying with %s" % (test_image))
        img = plt.imread(os.path.join(IMAGE_DIR, test_image))
        img_vec = image_search_utils.get_vector_rgb(img)
        results = image_search_utils.search(img_vec, es, ES_INDEXNAME, 
                                            ES_DOCTYPE, top_n)
        result_images = [result[0] for result in results]                                            
        result_mrr = 0.1        
        for rank in range(len(result_images)):
            if result_images[rank] == test_image:
                result_mrr = 1.0 / (rank + 1)
                break
        mrr += result_mrr
    mrrs.append(mrr /  len(test_images))

plt.plot(top_ns, mrrs)
plt.xlabel("Number of features (Top N)")
plt.ylabel("Mean Reciprocal Rank (MRR)")
plt.grid(True)
plt.show()        

The output of this code is the chart below, which shows that we get the best MRR overall with around 35 features, and plateues thereafter. The MRR at the platueue is around 0.8, so most of our query images are returned within the top 2 ositions in the search results, which I think is quite good for such a simple model.


The flip side of this analysis is to see how the scores decrease by position. Understanding this behavior would enable us to set a threshold for "good" matches. Keeping the top_N parameter set to 35, we run a small subset of images and chart their scores by position. The code to do this is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# -*- coding: utf-8 -*-
# Source: eval_scores_rgb.py
from __future__ import division, print_function
import elasticsearch
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os

import image_search_utils

IMAGE_DIR = "../data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"
NUM_TESTS = 5

# gather 5 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, NUM_TESTS, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
    if curr_idx in test_image_ids:
        test_images.append(fname)
    curr_idx += 1

es = elasticsearch.Elasticsearch(hosts = [{
    "host": ES_HOST,
    "port": ES_PORT
}])

cmap = matplotlib.cm.get_cmap("Spectral")
color = np.linspace(0, 1, NUM_TESTS)

for i in range(NUM_TESTS):
    img = plt.imread(os.path.join(IMAGE_DIR, test_images[i]))
    img_vec = image_search_utils.get_vector_rgb(img)
    results = image_search_utils.search(img_vec, es, ES_INDEXNAME, ES_DOCTYPE,
                                        top_n=35, start=0, size=50)
    scores = [result[1] for result in results]
    plt.plot(range(len(scores)+1)[1:], scores, color=cmap(color[i]))   
plt.axhline(0.9, 0, 50, color='r', linewidth=1.5, linestyle="dotted")
plt.xlabel("Result Position (Rank)")
plt.ylabel("Score")
plt.grid(True)
plt.show()

And the chart generated by the above program is as follows. As can be seen, all the queries result in 5-10 images above the 0.9 mark (marked by the red dotted line), which might be a good threshold to use in this case if we were trying to find very similar images.


Thats all I have for today. In this post, I described about how ES handles Payloads and function queries. Looking back with the work I did with Solr payloads, I found the ES implementation quite intuitive and easy to use. I was also able to build an index where I store pre-built vectors (as opposed to have the index create vectors out of the text) and compute cosine similarities using function queries. I am actually quite pleasantly surprised at how well the functionality is working. In coming weeks, I plan on making some improvements, will have more to share once I do that.