Leveraging Lucene


Imagine a catalog of a few hundred thousand items. These items have been labeled into a few hundred categories. Each item can be linked to up to three categories. New categories need to be added in order to make things easier to find. However, categories without content are useless. So some content needs to be linked to the new categories. Luckily both the items and the categories have a description, and that makes things easier.

The idea is simple:

  • Put all items into a searchable index.
  • For each category, find out what the most important words are.
  • Create a search term using these most important words, by just sticking them together.
  • Search the index of items for best matches.

And we’re done, sort of. It’s a bit more complicated than that, but that’s mostly because of finetuning.

Building the searchable index

This is where Apache Lucene comes in. Lucene is an open source full text indexing and search library, supported by the Apache Software Foundation.
First released in 1999, it is still in active development.

To create an index, you need to create an IndexWriter, and use it to add Documents to the index.

public IndexWriter createIndexWriter() throws IOException {
        Directory indexDir = FSDirectory.open(Paths.get(INDEX_DIR));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig icw = new IndexWriterConfig(analyzer);
        icw.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        return new IndexWriter(indexDir, icw);
    }

Only one writer is needed for adding items to the index. Note that IndexWriterConfig.OpenMode.CREATE will create a new index.
If there is anything already in the index, it will be removed. IndexWriterConfig.OpenMode.CREATE_OR_APPEND could be used if you want to add to an existing index.

Next up is actually adding things to the index. Each item in the index is called a Document. Documents have Fields that can be used for searching.

Creating and adding a document can be done like this:

try {
	Document document = new Document();
	document.add(new StringField("url", "http://ghyze.nl", Field.Store.YES));
	document.add(new TextField("title", "My awesome blog",  Field.Store.YES));
	document.add(new TextField("description", "Blogging about things",  Field.Store.YES));

	writer.addDocument(document);
} catch (IOException e){
	e.printStackTrace();
}

When we’re done with adding the documents, we need to close the IndexWriter:

writer.close();

Find important words

Let’s first define what an “important word” is. Important words are words that are the most relevant for each document in the collection.
There is a nice algorithm to determine the relevance of each word: tf-idf (short for term frequency–inverse document frequency).

The premise of this algorithm is that a word is more relevant for a document the more it appears in that document, and less relevant for a single document when it appears in more documents.

  • For each document, we count how many times each word appears and we divide that by the total number of words in this document. This is the term-frequency part.
  • For each word in every document, take the total number of documents and divide it by the number of documents containing this word. We don’t want this number to be too large, so we take the log of this. This is the inverse document frequency part.
  • Multiply these numbers to get the relative importance of each word for every document. A Higher score means the word is more important.

The code for this algorithm consists of two classes and a value object

De first class represents a single document in the collection. It is responsible for calculating the importance of the words it contains in relation to the words of all other documents.

import lombok.Getter;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Document {

    /**
     * The identifier for this document
     */
    @Getter
    private final String id;

    /**
     * The complete text for this document
     */
    @Getter
    private final Collection<String> lines;

    /**
     * Every word in this document, with the number of times it appears
     */
    private Map<String, Integer> wordsInDocument;

    /**
     * Constructor
     */
    public Document(String id, Collection<String> lines){
        this.id = id;
        this.lines = lines;
    }

    /**
     * Get the map with the unique words in this document, and the number of times they appear
     */
    public Map<String, Integer> getWordsInDocument(){
        if (wordsInDocument == null){
            calculateWordMap();
        }
        return wordsInDocument;
    }

    /**
     * Calculate the number of times each unique word appears in this document
     */
    private void calculateWordMap(){
        wordsInDocument = new HashMap<>();
        for (String line : lines){
            String[] words = line.split("\\s");
            for (String word : words){
                if (word.trim().length() > 1) {
                    Integer wordCount = wordsInDocument.getOrDefault(word, Integer.valueOf(0));
                    wordsInDocument.put(word, wordCount+1);
                }
            }
        }
    }

    /**
     * Calculate the importance of each word, compared to other words in this document
     * and all other documents in the index
     * @param index The collection of documents that also contains this document.
     * @return An ordered list indicating the importance of each word in this document.
     */
    public List<WordImportance> calculateWordImportance(WordIndex index){
        List<WordImportance> wordImportance = new ArrayList<>();
        double totalWordsInDocument = getWordsInDocument().values().stream().mapToInt(Integer::intValue).sum();
        double totalNumberOfDocuments = index.getNumberOfDocuments();
        for (String word : getWordsInDocument().keySet()){
            double tf = ((double) getWordsInDocument().get(word)) / totalWordsInDocument;
            double idf = Math.log(totalNumberOfDocuments / ((double) index.getNumberOfDocumentsContaining(word)));
            wordImportance.add(new WordImportance(word, tf*idf));
        }

        // most important word first
        wordImportance.sort(Comparator.comparing(WordImportance::getImportance).reversed());

        return wordImportance;
    }
}

The next class represents the collection of all documents. It is responsible for calculating for each word the number of documents that contain it.

import lombok.Getter;

import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class WordIndex {

    /**
     * The index, key is the identifier of the document
     */
    @Getter
    private Map<String, Document> index = new HashMap<>();

    /**
     * A map with all the words in all the documents, and the number of documents containing those words
     */
    private Map<String, Integer> documentCountForWords = null;

    /**
     * Constructor
     */
    public WordIndex(){
    }

    /**
     * Add a document to the index, overwriting when it already exists.
     * @param document the document to add
     */
    public void addDocument(Document document) {
        if (index.containsKey(document.getId())) {
            System.out.println("Overwriting document with ID "+ document.getId());
        }

        index.put(document.getId(), document);
    }

    /**
     * Get all words in all documents. If a word appears multiple times, it is only returned once.
     * @return All words in this document
     */
    public Collection<String> getAllWords(){
        Set<String> allWords = new HashSet<>();

        index.values()
                .forEach(e -> allWords.addAll(e.getWordsInDocument().keySet()));

        return allWords;
    }

    /**
     * Get the map with number of documents per word
     * @return the map with number of documents per word
     */
    public Map<String, Integer> getDocumentCountForWords(){
        if (documentCountForWords == null){
            calculateDocumentCountForAllWords();
        }
        return documentCountForWords;
    }

    /**
     * Iterate over every word in every document, and count the number of documents that word appears in.
     */
    private void calculateDocumentCountForAllWords(){
        Collection<String> allWords = getAllWords();
        documentCountForWords = new HashMap<>();
        for (String word : allWords){
            for (String documentId : index.keySet()){
                Map<String, Integer> document = index.get(documentId).getWordsInDocument();
                if (document.keySet().stream().anyMatch(e -> e.equals(word))){
                    Integer count = documentCountForWords.getOrDefault(word, 0);
                    documentCountForWords.put(word, count+1);
                }
            }
        }
    }

    /**
     * Get the total number of documents
     * @return
     */
    public int getNumberOfDocuments(){
        return index.size();
    }

    /**
     * Get the number of documents this word appears in.
     * @param word The word we're interested in
     * @return The number of documents containing this word
     */
    public int getNumberOfDocumentsContaining(String word){
        Map<String, Integer> wordCount = getDocumentCountForWords();
        return wordCount.getOrDefault(word, 0);
    }
}

Then we have a value object, that is used by the document to indicate the relative importance of each word.

import lombok.AllArgsConstructor;
import lombok.Value;

@Value
@AllArgsConstructor
public class WordImportance {
    String word;
    double importance;
}

The easy steps

Now it’s time to search for content for the new categories. To do this, we take most important words of the new categories and string them together, separated by spaces. Then we use that search term to search the index, and extract the result.

This is what the code for performing the search would look like:

/** 
 * Prepare the search engine
 */
public void initializeSearch(){
	try {
		indexReader = DirectoryReader.open(FSDirectory.open(Paths.get(BuildStudyIndex.INDEX_DIR)));
		searcher = new IndexSearcher(indexReader);
		analyzer = new StandardAnalyzer();

		queryParser = new MultiFieldQueryParser(new String[]{"title", "shortDescription", "longDescription"}, analyzer);
	} catch (IOException e){
		e.printStackTrace();
	}
}

/**
 * Perform search
 */
public TopDocs search(Collection<String> words) throws IOException, ParseException{
	String searchTerm = String.join(" ", words);
	Query query = queryParser.parse(searchTerm);
	TopDocs results = searcher.search(query, 200000);
	return results;
}

The code to extract the search results would be something like this:

TopDocs result = search(searchTerms);
for (ScoreDoc hit : result.scoreDocs){
	Document found = searcher.doc(hit.doc);
	double score = hit.score;
}

Conclusion

There are some steps that we needed to do that I haven’t mentioned. But those are mainly plumbing and finetuning.

We have seen how to use Apache Lucene as a custom search engine. First we’ve built a searchable index, and later we have searched that index for relevant items.
We have also seen how to implement an algorithm that determines the most relevant words in a specific document, compared to the other documents in a collection.

The reason this works is that words have meaning. I know, stating the obvious. Each word gives meaning to the text, and this meaning has varying degrees of relevance to that text. The words that are most relevant to the text distinguish the meaning of the text from the other texts in the collection. We don’t need to know the actual meaning of the text, we just need to separate it from all other texts. Then, through the magic of search engines, we can match the texts that have the most similar meanings.


Geef een reactie

Je e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *