Felipo: My own PHP framework

August 21, 2012

In my opinion when you have done a project that you like and you wouldn’t like to let it die, forgotten in some server in Slovakia, you open source it. That is exactly what I did with Felipo (editor note: Felipo is the name of the author’s cat).

https://github.com/espinaemmanuel/Felipo

If you see some similarities with other frameworks you are right. Felipo is strongly inspired in Ruby on Rails but written in PHP. Anyone interested in extending it will be welcomed. Just send me a mail and if you want to contribute your changes back create a pull request in Github. It is licensed under Apache so there are little restrictions on what you can do with it.

I will very briefly summarize some of its characteristics. You will notice that some parts are in Spanish and some in English. That’s because it started as a project in Spanish, not as framework and with the time it evolved into an independent piece of software. The language for new additions will always be English.

And finally you’ll notice the lack of unit tests in this project. Well that is bad and I know it. Again, anyone can add them now that it is open source.

Front controller architecture based on modules

The architecture is basically the same as many web frameworks and is based on the design proposed by Martin Fowler. The idea, essentially, is to separate the view from the controllers. All the request are redirected via mod_rewrite to the same php script (front_controller.php) that parses the request. This, based on the URI string, redirects the request to the right controller. This redirection is based on a convention in the url:

http://<base_domain>/module/controller/action?params

The system is organized in modules. For example one module can be the admin interface and another the front end. Each module corresponds to a directory in the apps directory. In the pageControllers directory inside each module you find the modules that are classes, and the actions are methods in those classes.

Multiple configuration environments

You can define a configuration for production, development, testing, etc. One environment will be selected based on different rules, currently there is a default one and another that is chosen based on the HTTP request “host” header (it is assumed that the production environment will have the real domain as the value for this header, and dev will have “localhost”). Based on that, the right config file is loaded.

Extensible via plugins

The initial idea was to make this system easy to extend. A plugin is a folder in the include directory. Among the plugins currently in the system it’s worth noticing the database plugin that implements the Active Record pattern and the REST plugin that adds controllers to expose the Active Record via a REST interface.

The plugins are loaded selectively according to the $config ['plugins'] values in the configuration file.

Database connections via Active Record

Felipo implements a very lightweight version of active record. For example to save a person in a database:

class Person extends ActiveRecord {}
$person = new Person();
$person->id = 123;
$person->name = “Emmanuel”;
$person->lastName = “Espina”;
$person->save();

This will execute in the database:

INSERT INTO “Person” (id, name, lastName) VALUES (123, “Emmanuel”, “Espina”);

As you can see it is very simple (in simple cases). To load the person you do:

$person = Person::loadById(123);

The active records go in the models directory in each module.

Easy REST resources

Now that you have a Person represented as an active record you can expose it to the world with a ActiveRecordResource

class Person extends ActiveRecordResource {}

That goes to the resources directory. Currently for this to work you must have a corresponding active record in the models directory with the name Person_AR (I’ll fix this in the future)
Now if you want to get the person you create a rest controller inheriting from RestControllerGenerico

And send it a request like:

http://<base_domain>/module/rest/index/Person/123

And you will get the person formatted as a JSON object.

What else?

To keep this post short I didn’t include elements like Authenticators (there are LDAP and Mysql based authentication modules). Also the login and session management is already included as another plugin.

Finally there is a set of validators and html form generators that take a specification of a form (as a JSON object) and creates the html and then can create validators on run time to test if it passes simple validations.

You can investigate all of these features reading the code (and documenting it if you want :) )

The interesting thing about this framework is that it is small, and one of the main design decision was to make it fully modular by the use of plugins. Almost anything is a plugin even the database connectivity. This should keep the system simple enough for anyone to understand it and extend it relatively easy.

Multivalued geolocation fields in Solr

September 30, 2011

Today we’ll see a small workaround that can be used to accomplish a very common use case regarding geographic search.

The example use case is the following

The client is located in Buenos Aires and wants a purple tie

He enters the query “purple tie” in the search box from or store web page

The system returns the ties that can be purchased in the stores near Buenos Aires, and then in our stores in Montevideo (ie ties that can be found in nearby stores)

That is, the system does not returns ties in our stores in Spain because nobody would travel to Spain to buy a tie nor will order a tie across the Atlantic Ocean (no tie is so special). This is part of the first and probably only problem in search systems: return only relevant results. In this case relevant considering their location.

The problem in theory could be easily solved with a multivalued coordinate field, where each tie (or product in our system) will have a list of coordinates where that tie is located. We would perform a filtering of our products in a circle centered in Buenos Aires to get the nearby products.

This works fine if the product is available only in one store, but when it is in multiple locations simultaneously (ie, there is a list of coordinates in the field, not a single one) the problem arises. It happens that Solr does not allow filtering on multivalued coordinate fields.

But not everything is lost, and in this post I’ll propose a workaround to solve this issue.

We are going to create another index, containing only the Stores, their id, and location. Using a C style pseudo code we can consider this as a document in our “stores index”

struct {
latlon location;
int storeID;
} store;

Having this other index (another Solr Core for example) we are going to split our query in two:

1- For the first query you will query the “stores index”. You are going to perform a geographic filter query (using the spatial Solr functionality) and you will get a list of store ids near the central point you specified. You haven’t used the query entered by the user in this phase yet, only the location of the user got by some means.

q=*:*&fl=storeID&fq={!geofilt pt=-34.60,-58.37 sfield=location d=5}

And you obtain a list of store ids

id=34, id=56, id=77

2 – In the second phase you are going to perform the query (now providing the query that the user entered) in the regular way, but with the addition of a filter query (a regular filter query, not a geographic one) in the following fashion:

q=”product brand x”&fq=(storeId:34 OR storeId:56 OR storeId:77)

where the store ids are the ones returned by the first query and, consequently, are the ones near the central point.

In this case you are restricting your results to the ones near the user. You can also boost the nearest results but also showing the ones that are far away in advanced pages of the results.

To summarize, we use two queries with regular functionality to get the advanced functionality you desire. The first query gets the stores near the zone, and the second is the actual query.

There is a patch in development to accomplish the same functionality SOLR-2155 but is has not been commited yet. Meanwile here you have a good example of what you can do with multiple queries to Solr.

Enhancing search results using machine learning

August 23, 2011

Today’s blog will be about enhancing search results by the use of machine learning. The principle is simple: if you have a very popular site you can make use of the behavior of your users to improve the search quality of new users.

Well I don’t have a very popular site :( So this post will be about a theoretical idea that still have to be tested. If you want to perform tests about the solution I’m about to give, please contact me (or leave a comment). That is what I am missing to be able to write a serious paper (or a serious post at least).

Introduction

To introduce you in the topic let’s think about how the users are used to work with “information retrieval platforms” (I mean, search engines). The user enters your site, sees a little rectangular box with a button that reads “search” besides it, and figures out that he must think about some keywords to describe what he wants, write them in the search box and hit search. Despite we are all very used to this, a deeper analysis of the workings of this procedure leads to the conclusion that it is a quite unintuitive procedure. Before search engines, the action of “mentally extracting keywords” from concepts was not a so common activity.

It is something natural to categorize things, to classify the ideas or concepts, but extracting keywords is a different intellectual activity. While searching, the user must think like the search engine! The user must think “well, this machine will give me documents with the words I am going to enter, so which are the words that have the best chance to give me what I want”

Would’t be great to search by concept, for example, the user wants documents from the World War II about the Normandy Landings. So he enter “Normandy Landings” and the search engine figures out that that guy is looking for documents about the WWII more precisely about the Normandy Landings, not just documents with the words “Normandy” and “Landings”

The problem

Well readers, the problem that I just described already exists, and its called Web Query Classification. You can read the full idea there, in the wikipedia.

What I propose here is to use a simple yet powerful machine learning technique (that is Naive Bayes) to extract the categories from the user entered query to enhance the results quality, based on query logs to train the model.

But what do I mean with categories? Intentionally ambiguous, that term applies to a broad range of problems. In particular I came up with this idea from a client’s problem. This client wanted to improve the results extracting brand names and product types from the query string. So if the person entered “cherry coke” they wanted to show the results of “Coca Cola Cherry”. Having extracted the brand names (if any) from the query, they wanted (in some way) to improve (boost or filter) the results. So for “cheery coke” they would extract “Coca Cola” as the brand, and from among all the Coca Cola products they will search for “cherry” or “coke” (it is likely to get the Coca Cola Cherry with a high ranking in the results)

The possible solution

The web server logs are a good source to train the Naive Bayes algorithm. This idea is similar to the one found in this paper: Probabilistic Query Expansion Using Query Logs – Hang Cui. They are storing the query that the user entered and the document that the user clicked afterwards.

session := <query text> [clicked document]*

The difference that I’m proposing here is not to store the clicked document but it’s associated categories (or “ideas”, “concepts”, etc). The query will be composed of queries and the categories of the documents that the user clicked after the search. For example, some entries in the logs could be:

“world war” => [World War II]
“world war II” => [World War II]
“Normandy landings” => [World War II]
“Germany 1945” => [World War II]
“Germany 1945” => [World War II]
“germany 1945” => [German Cinema]

[World War II] is just a tag name. It wont be tokenized, or anything as it is just a representation of an idea or concept. “Normandy landings” refers to the concept of [World War II], and “Germany 1945” also refers to the same concept.

Note the last entry in the example. In this case the user clicked an article about Wim Wenders, a german film director that was born in 1945. This is an example of a non representative result. If you say “Germany 1945”, most people will think about the war. However, somebody looking for a german film director that was born in 1945 whose name he doesn’t recall, could enter “Germany 1945” to get to the page of Wim Wenders. Looking for “germany 1945 movies” could improve his results to find Wenders. Our system should return two “ideas” from this text [World War II] and [German Cinema], and the results should contain mainly German propaganda films from that period, because that is what a human would most likely be looking for with that text.

To summarize, we are trying to get the categories with the higher chance.

How did we created this log? Easy, each time the user clicked on a document we check if he had performed a search. If he has, then we extract the categories from the document and store a log entry in the previously described format. But a question that will naturally arise is how to extract the categories from a document. That is beyond the scope of this post, and I assume that somebody has read all the documents that we have and have manually assigned categories to them. The category (or idea, or brand, or concept) is just another field in our documents database, available from the very beginning of our experiment.

Naive Bayes

To get the categories with the greater chance, we are going to have to deal with probabilities. I’ll give you a super fast description of the Naive Bayes algorithm. I used it in previous posts but now I’ll give you a brief introduction to what it does.

Suppose you have a text and you want to extract which category it belongs to. Since you don’t know the “true” category of the text you would like to get the category with the greater chance from among all the possible categories (or the three most probable categories, for instance). This, written in math language, would be the following formula

max P(cat | germany 1945 )

That is, given that the user entered “germany 1945”, we get the probabilities of each of the categories of being associated with that text. We take the category that gave the higher probability and return that category.

The problem is solved better by flipping the condition by means of the Bayes theorem

P(<cat> | germany 1945) = \frac{P( germany 1945 | <cat>)P(<cat>)}{P(germany 1945)}

To maximize this the parameter that we are allowed to modify is the category (the text is given and we can not modify that). So we can omit the P(“germany 1945”) to simplify the problem

max P( germany 1945 | <cat>)P(<cat>)

There are two parts here. Some categories are more probable than others. For example think about the categories <Quantum Electrodynamics> and <Sex> and you’ll immediately recognize a tendency in the popularity of those terms. Those probabilities can be also estimated through the log.

For the first part we need to consider the individual terms. Not “germany 1945” but “germany” and “1945”. To calculate the probability of P( “germany” intersection “1945” | <cat>) we need to make the “naive assumption” of Naive Bayes: all the terms in the query string are conditionally independent. That means that “germany” is completely independent from “1945”. We know that it is not true, but with this assumption the experience has proven to give very good results. From this assumption we can change the formula to the following:

P( germany | <cat>)P( 1945 | <cat>)P(<cat>)

I don’t extend much further, but there are some issues if a word have never appeared associated with a category. For example searching “World War II Bananas” the probability of P(“bananas” | [WWII]) would be 0 and the entire calculation will be 0. But we know that the entered text, despite being odd, is related to [WWII]. A solution to that is adding 1 to each probability. The final calculation that we must perform is:

[P(germany|<cat>) + 1][P(1945|<cat>) + 1][P(<cat>) + 1]

This is called Laplace smoothing or additive smoothing

 Implementation

We are not going to implement all that! We are going to use Mahout (again). We can recycle some of the code from the previous post about classifying mail for this.

We need some script to pre-process the query log to take it to the format that the Naive Bayes algorithm accepts.

That is to get something like:

World_War_II normandy landings
World_War_II germany 1945
World_War_II germany 1945
German_Cinema germany 1945

This is very similar to the text input from the previous post (the category must be expressed as a single word). After training the model we can use exactly the same program to create an online classifier of queries.

Then and assuming that we are using Solr as the search engine we can add boost queries to the regular query. This process would be:

The user enters “Germany 1945 movies”. The system queries our “online query classifier” and gets two possible categories: [WWII] and [German films]. Now we submit a query to Solr in the following fashion:

http://solrserver/solr/dismax?q=Germany+1945+movies&bq=category:WWII&bq=category:german_films

The category classifier can give us a score (extracted from the Naive Bayes calculation). That, with some application logic, can be used to boost the categories. For example if we got [WWII] with a high ranking we can give a higher boost to the query:

q=Germany+1945+movies&bq=category:WWII^3

 Conclusions

There is little to say here since I don’t have a data set to test this blackboard experiment. So I can not say much more than that it is a very elegant idea. However, and as I previously said, this is not a very original idea. It has been studied before and a search in Internet will give you plenty of results. The interesting thing here is that all that we need to apply this in our homes is already available through open source software. Both Solr and Mahout are open source project. With them you can build a learning search engine in a couple of minutes.

Returning to the dataset, I would be glad that somebody shares some query logs. This things work better with a lot of queries and the best example I could find is http://www.sigkdd.org/kdd2005/kddcup.html that is a Data mining competition where they have 800 classified queries. This is not enough for our purposes.


Ham, spam and elephants (or how to build a spam filter server with Mahout)

April 26, 2011

Introduction

Something quite interesting has happened with Lucene. It started as a library, then its developers began adding new projects based on it. They developed another open source project that would add crawling features (among others features) to Lucene. Nutch is in fact a full featured web serach engine that anyone can use or modify. Inspired in some famous papers from Google about Map Reduce and the Google Filesystem, new features to distribute the index where added to Nutch and, eventually, those features became a project by them own: Hadoop. Since then, many projects have being developed over Hadoop. We are in front of a big explosion of open source code that was ignited by the Lucene spark.

All this projects are in some way related to content processing. For all of you interested in search and information retrieval, now we are going to talk about another project for a domain that’s outside the strict confines of search, but can teach you some interest things about content processing.

Recently I have been reading about this new library, Mahout, that provides all those obscure and mysterious machine learning algorithms, together in a library. A lot of modern web sites are using machine learning techniques. These algorithms are rather old and well known, but were popularized recently by its extensive use in social network sites (facebook knowing better than you who could be your best friend) or Google (reading your mind and guessing what you may have wanted to write in the search box) .

In simple words, we can say that a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

For example if you wanted to make a program to recognize captchas you would have:

T: recognizing captchas

P: percentage of words correctly recognized

E: a database of captchas with correct words spellings

So, it appears that all they were needing was computing power. And they need a lot of computing power. For example, in supervised machine learning (I learned that term a month ago, don’t expect me to be very academic) you show the machine some examples (experience) of what you want and the machine extract some patterns from them. It is pretty the same that a person does when he is learning: from some examples shown to the person, he uses its past knowledge to infer some kind of pattern that will help him to classify future events of the same kind. Well the computers are pretty dumb in learning things, so you have to show them a lot of examples to infer the pattern.

Classification, however, is only part of the problem. There are three big areas that Mahout targets (and there will be more in the future because the project is relatively new). Classification, Clusterization, and Recommendation. Typical examples of these:

  • Classification: it is supervised learning, you give the machine a lot of instances of… things (documents for example) along with their category. From all of those the machine learns to classify future instances in the known categories.
  • Clusterization: similar to classification in the sense it makes groups of things, but this one is not supervised. You give the machine a lot of things, and the machine makes groups of similar items.
  • Recommendation: It is exactly what IMDb or Amazon does with the recommended movies or books at the bottom of the page. Based on other users likes (which are measured through the ranking stars that the user vote) Amazon can infer “other users that liked this book also liked these others”.

If you want a better definition of these three categories you can go to Grant Ingersoll’s blog.

Offline example

Now I’ll show you a simple program that I think is a pretty obvious classification example. We are going to classify mail into spam and not-spam (that is called ham by people that researches it). The steps are the following:

  1. Get a ham/spam corpus from the web (already classified of course).
  2. Train a classifier with 80% of the corpus, and leave 20% for testing
  3. Create a simple web service that classifies spam online. You provide it a mail, and it will say “good” or “bad” (or “ham” or “spam”)

The corpus we are going to use is the one from Spam Assasin, an open source project from Apache that is an antispam filter. This is a tutorial, so we are not trying to classify very difficult mails just to show how simple it can be done with Mahout (of course, the difficult stuff was programmed by the developers of this library). This corpus is a pretty easy one and the results will be very satisfying.

Mahout comes with some examples already prepared. One of those is the 20 newsgroups example that tries to classify many mails from a newsgroup into their categories. This example is found in the Mahout wiki, and luckily for us, the format of the newsgroups are pretty the same as our mails. We are going to apply the same processing chain to our mails than the 20 newsgroup. By the way we are going to use an classification algorithm called naive Bayes that uses the famous Bayes theorem, and that I already mentioned in a previous post. I’m not going to explain how the algorithm works, I’ll just show you that it works!

Mahout has two driver programs (they are called that way because they are also used in Hadoop to run as map reduce jobs), one for training a classifier, and the other to test it.
When you train the classifier you provide it with a file (yes, a single file) that contains one document per line, already analyzed. “Analyzed” in the same sense that Lucene analyzes documents. In fact what we are going to use is the Lucene StandardAnalizer to clean a little the documents and transform them into a stream of terms. That stream is put in a line of this training file, where the first term is the category that the item belongs. For example the training file will look like this

ham new mahout version released
spam buy viagra now special discount

A small program comes with Mahout to turn directories of documents into this format. The directory must have an internal directory for each category. In our case we are going to separate our test set into two directories, one for testing and the other for training (both in <mahout_home>/examples/bin/work/spam where <mahout home> is where you unzipped the mahout distribution).
In each of them we are going to put a spam directory and a ham directory.

test

ham
spam

train

ham
spam

We manually take about 80% of the ham and put it in train/ham, the rest int test/ham, and the same with the spam in train/spam and test/ham (it has never been easier to prepare the test set!!!)
Next, we are going to prepare the train and test files with the following commands

bin/mahout prepare20newsgroups -p examples/bin/work/spam/train -o examples/bin/work/spam/prepared-train -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8
bin/mahout prepare20newsgroups -p examples/bin/work/spam/test -o examples/bin/work/spam/prepared-test -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

Default analyzer is a Lucene analyzer (actually it is wrapped within a mahout class)

We are going to train the classifier. Training the classifier implies feeding mahout with the train file and letting him build internal structures with the data (yes, as you can deduce by my use of the word “internal”, I don’t have any idea how that structures work).

bin/mahout trainclassifier -i examples/bin/work/spam/prepared-train -o examples/bin/work/spam/bayes-model -type bayes -ng 1 -source hdfs

The model is created in the bayes-model directory, the algorithm is Bayes (naïve Bayes) we are using Hadoop Distributed File System (we are not but you tell that to the command when you are not using a distributed database like Hbase), and ng is the ngrams to use. The ngrams are groups of words. Giving more ngrams you add more context to each word (surrounding words). The most ngrams you use the better the results should be. We are using 1 because the better results obviously cost more processing time.

Now we run the tests with the following command

 bin/mahout testclassifier -m examples/bin/work/spam/bayes-model -d examples/bin/work/spam/prepared-test -type bayes -ng 1 -source hdfs -method sequential

And after a while we get the following results

-------------------------------------------------------
 Correctly Classified Instances : 383 95,75%
 Incorrectly Classified Instances : 17 4,25%
 Total Classified Instances : 400
=======================================================
 Confusion Matrix
 -------------------------------------------------------
 a b <--Classified as
 189 11 | 200 a = spam
 6 194 | 200 b = ham

Very good results!!!

A server to classify spam in real time

But we haven’t done anything different from the 20newsgroup example yet! Now, what can we do if we want to classify mails as they are coming. We are going to create a antispam server, where the mail server will send all the mails that it receives and our server will respond if it is ham or spam (applying this procedure)

The server will be as simple as we can (this is just a proof of concept):

public class Antispam extends HttpServlet {

private SpamClassifier sc;

public void init() {
 try {
 sc = new SpamClassifier();
 sc.init(new File("bayes-model"));
 } catch (FileNotFoundException e) {
 e.printStackTrace();
 } catch (InvalidDatastoreException e) {
 e.printStackTrace();
 }
 }

protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {

Reader reader = req.getReader();
 try {

long t0 = System.currentTimeMillis();
 String category = sc.classify(reader);
 long t1 = System.currentTimeMillis();

resp.getWriter().print(String.format("{\"category\":\"%s\", \"time\": %d}", category, t1-t0));

} catch (InvalidDatastoreException e) {
 e.printStackTrace();
 }

}

}

As we are going to do a very simple example we are going to use a simple servlet. The important class is SpamClassifier


public class SpamClassifier {

private ClassifierContext context;
 private Algorithm algorithm;
 private Datastore datastore;
 private File modelDirectory;
 Analyzer analyzer;

public SpamClassifier(){
 analyzer = new DefaultAnalyzer();
 }

public void init(File basePath) throws FileNotFoundException, InvalidDatastoreException{

if(!basePath.isDirectory() || !basePath.canRead()){
 throw new FileNotFoundException(basePath.toString());
 }

modelDirectory = basePath;

algorithm = new BayesAlgorithm();
 BayesParameters p = new BayesParameters();
 p.set("basePath", modelDirectory.getAbsolutePath());
 p.setGramSize(1);
 datastore = new InMemoryBayesDatastore(p);
 context = new ClassifierContext(algorithm, datastore);
 context.initialize();
 }

public String classify(Reader mail) throws IOException, InvalidDatastoreException {
 String document[] = BayesFileFormatter.readerToDocument(analyzer, mail);
 ClassifierResult result = context.classifyDocument(document, "unknown");

return result.getLabel();
 }

}

You have a datastore and an algorithm. The datastore represents the model that you previously created training the classifier. We are using a InMemoryBayesDatastore (there is also HbaseBayesDatastore that uses the Hadoop database), and we are providing it the base path and the ngrams size. We are using ngrams of 1 to simplify this example. Otherwise it is necessary to postprocess the analyzed text constructing ngrams.
The algorithm is the core of the method and it is an obvious instance of the Strategy design pattern. We are using BayesAlgorithm but well we could have used CbayesAlgorithm that uses the Complementary Naive Bayes Algorithm.
ClassifierContext is the interface you’ll use to classify documents.

We can test our server using curl:

curl http://localhost:8080/antispam -H "Content-T-Type: text/xml" --data-binary @ham.txt

and we get

{"category":"ham", "time": 10}

Conclusions

As we have seen, the spam filtering process can be separated into two parts. An offline process where you have a lot of mails already classified by someone, and train the classifier. And an online process where you test a document to classify it using the model previously created. The model can evolve, you can add more documents with more information and after you perform the offline processing you update the online server with the new model. The model can be very big. This is where Hadoop enters the scene. The offline process can be sent to a cluster running hadoop, and using the same libraries (Mahout!) you perform what looks like the exactly same algorithms and get the results faster. Of course the algorithms is not the same because it is been executed in parallel by the thousand of computers that you surely have in your cluster (or the two or three PCs you have). Mahout was designed with this in mind. Most of its algorithms were tailored to work over Hadoop. But the interesting thing is that they can also work without it for testing purposes, or when you must incorporate the algorithms to a server without the needs of distributed computation, like how we did in this post. The combination of its possibilities to be user over a cluster and to be embedded to a application makes Mahout a powerful library for modern applications that use data of web scale.


Creating a spellchecker with Solr

January 31, 2011

Introduction

In a previous post I talked about how the Solr Spellchecker works and then I showed you some test results of its performance. Now we are going to see another aproach to spellchecking.

This method, as many others, use a two step procedure. A rather fast “candidate word” selection, and then a scoring of those words. We are going to select different methods from the ones that Solr uses and test its performance. Our main objective will be effectiveness in the correction, and in a second term, velocity in the results. We can tolerate a slightly slower performance considering that we are gaining in correctness of the results.

Our strategy will be to use a special Lucene index, and query it using fuzzy queries to get a candidate list. Then we are going to rank the candidates with a Python script (that can easily be transformed in a Solr spell checker subclass if we get better results).

Candidate selection

Fuzzy queries have historically been considered a slow performance query in relation with others but , as they have been optimized in the 1.4 version, they are a good choice for the first part of our algorithm. So, the idea will be very simple: we are going to construct a Lucene index where every document will be a dictionary word. When we have to correct a misspelled word we are going to do a simple fuzzy query of that word and get a list of results. The results will be words similar to the one we provided (ie with a small edit distance). I found that with approximately 70 candidates we can get excellent results.

With fuzzy queries we are covering all the typos because, as I said in the previous post, most of the typos are of edit distance 1 with respect to the correct word. But although this is the most common error people make while typing, there are other kinds of errors.

We can find three types of misspellings [Kukich]:

  1. Typographic errors
  2. Cognitive errors
  3. Phonetic errors

Typographic errors are the typos, when people knows the correct spelling but makes a motor coordination slip when typing. The cognitive errors are those caused by a lack of knowledge of the person. Finally, phonetic errors are a special case of cognitive errors that are words that sound correctly but are orthographically incorrect. We already covered typographic errors with the fuzzy query, but we can also do something for the phonetic errors. Solr has a Phonetic Filter in its analysis package that, among others, has the double methaphone algorithm. In the same way we perform fuzzy query to find similar words, we can index the methaphone equivalent of the word and perform fuzzy query on it. We must manually obtain the methaphone equivalent of the word (because the Lucene query parser don’t analyze fuzzy queries) and construct a fuzzy query with that word.

In few words, for the candidate selection we construct an index with the following solr schema:

<fieldType name="spellcheck_text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" maxCodeLength="20" inject="false"/>
     </analyzer>
    </fieldType>

   <field name="original_word" type="string" indexed="true" stored="true" multiValued="false"/>
   <field name="analyzed_word" type="spellcheck_text" indexed="true" stored="true" multiValued="false"/>
   <field name="freq" type="tfloat" stored="true" multiValued="false"/>

As you can see the analyzed_word field contains the “soundslike” of the word. The freq field will be used in the next phase of the algorithm. It is simply the frequency of the term in the language. How can we estimate the frequency of a word in a language? Counting the frequency of the word in a big text corpus. In this case the source of the terms is the wikipedia and we are using the TermComponents of Solr to count how many times each term appears in the wikipedia.

But the Wikipedia is written by common people that make errors! How can we trust in this as a “correct dictionary”? We make use of the “colective knowledge” of the people that writes the Wikipedia. This dictionary of terms extracted from the Wikipedia has a lot of terms! Over 1.800.00, and most of them aren’t even words. It is likely that words with a high frequency are correctly spelled in the Wikipedia. This approach of building a dictionary from a big corpus of words and considering correct the most frequent ones isn’t new. In [Cucerzan] they use the same concept but using query logs to build the dictionary. It apears that Google’s “Did you mean” use a similar concept.

We can add little optimizations here. I have found that we can remove some words and get good results. For example, I removed words with frequency 1, and words that begin with numbers. We can continue removing words based on other criteria, but we’ll leave this like that.

So the procedure for building the index is simple, we extract all the terms from the wikipedia index via the TermsComponent of Solr along with frequencies, and then create an index in Solr, using SolrJ.

Candidate ranking

Now the ranking of the candidates. For the second phase of the algorithm we are going to make use of information theory, in particular, the noisy channel model. The noisy channel applied to this case assumes that the human knows the correct spelling of a word but some noise in the channel introduces the error and as the result we get another word, misspelled. We intuitively know that it is very unlikely that we get ‘sarasa’ when trying to type ‘house’ so the noisy channel model introduces some formality to finding how probable an error was.
For example, we have misspelled ‘houze’ and we want to know which one is the most likely word that we wanted to type. To accomplish that we have a big dictionary of possible words, but not all of them are equally probable. We want to obtain the word with the highest probability of having been intended to be typed. In mathematics that is called conditional probability; given that we typed ‘houze’ how high is the probability of each of the correct words to be the word that we intended. The notation of conditional probability is: P(‘house’|’houze’) that stands for the probability of ‘house’ given ‘houze’

This problem can be seen from two perspectives: we may think that the most common words are more probable, for example ‘house’ is more probable than ‘hose’ because the former is a more common word. In the other hand, we also intuitively think that ‘house’ is more probable than ‘photosinthesis’ because of the big difference in both words. Both of these aspects, are formally deduced by Bayes theorem:

P(house|houze) = \frac{P(houze|house) P(house)}{P(houze)}

We have to maximize this probability and to do that we only have one parameter: the correct candidate word (‘house’ in the case shown).

For that reason the probability of the misspelled word will be constant and we are not interested in it. The formula reduces to

Max(P(house|houze)) = Max(P(houze|house) P(house))

And to add more structure to this, scientists have given named to these two factors. The P(‘houze’|’house’) factor is the Error model (or Channel Model) and relates with how probable is that the channel introduces this particular misspell when trying to write the second word. The second term P(‘house’) is called the Language model and gives us an idea of how common a word is in a language.

Up to this point, I only introduced the mathematical aspects of the model. Now we have to come up with a concrete model of this two probabilities. For the Language model we can use the frequency of the term in the text corpus. I have found empirically that it works much better to use the logarithm of the frequency rather than the frequency alone. Maybe this is because we want to reduce the weight of the very frequent terms more than the less frequent ones, and the logarithm does just that.

There is not only one way to construct a Channel model. Many different ideas have been proposed. We are going to use a simple one based in the Damerau-Levenshtein distance. But also I found that the fuzzy query of the first phase does a good job in finding the candidates. It gives the correct word in the first place in more than half of the test cases with some datasets. So the Channel model will be a combination of the Damerau-Levenshtein distance and the score that Lucene created for the terms of the fuzzy query.

The ranking formula will be:

Score = \frac{Levenshtein}{log(freq) Fuzzy}

I programmed a small script (python) that does all that was previously said:

from urllib import urlopen
import doubleMethaphone
import levenshtain
import json

server = "http://benchmarks:8983/solr/testSpellMeta/"

def spellWord(word, candidateNum = 70):
    #fuzzy + soundlike
    metaphone = doubleMethaphone.dm(word)
    query = "original_word:%s~ OR analyzed_word:%s~" % (word, metaphone[0])

    if metaphone[1] != None:
        query = query + " OR analyzed_word:%s~" % metaphone[1]

    doc = urlopen(server + "select?rows=%d&wt=json&fl=*,score&omitHeader=true&q=%s" % (candidateNum, query)).read( )
    response = json.loads(doc)
    suggestions = response['response']['docs']

    if len(suggestions) > 0:
        #score
        scores = [(sug['original_word'], scoreWord(sug, word)) for sug in suggestions]
        scores.sort(key=lambda candidate: candidate[1])
        return scores
    else:
        return []

def scoreWord(suggestion, misspelled):
    distance = float(levenshtain.dameraulevenshtein(suggestion['original_word'], misspelled))
    if distance == 0:
        distance = 1000
    fuzzy = suggestion['score']
    logFreq = suggestion['freq']

    return distance/(fuzzy*logFreq)

From the previous listing I have to make some remarks. In line 2 and 3 we use third party libraries for Levenshtein distance and metaphone algorithms. In line 8 we are collecting a list of 70 candidates. This particular number was found empirically. With higher candidates the algorithm is slower and with fewer is less effective. We are also excluding the misspelled word from the candidates list in line 30. As we used the wikipedia as our source it is common that the misspelled word is found in the dictionary. So if the Leveshtain distance is 0 (same word) we add 1000 to its distance.

Tests

I ran some tests with this algorithm. The first one will be using the dataset that Peter Norvig used in his article. I found the correct suggestion of the word in the first position approximately 80% of the times!!! That’s is a really good result. Norvig with the same dataset (but with a different algoritm and training set) got 67%

Now let’s repeat some of the test of the previous post to see the improvement. In the following table I show you the results.

Test set % Solr % new Solr time [seconds] New time [seconds] Improvement Time loss
FAWTHROP1DAT.643
45,61%
81,91%
31,50
74,19
79,58%
135,55%
batch0.tab
28,70%
56,34%
21,95
47,05
96,30%
114,34%
SHEFFIELDDAT.643
60,42%
86,24%
19,29
35,12
42,75%
82,06%

We can see that we get very good improvements in effectiveness of the correction but it takes about twice the time.

Future work

How can we improve this spellchecker. Well, studying the candidates list it can be found that the correct word is generally (95% of the times) contained in it. So all our efforts should be aimed to improve the scoring algorithm.

We have many ways of improving the channel model; several papers show that calculating more sophisticated distances weighting the different letter transformations according to language statistics can give us a better measure. For example we know that writing ‘houpe’ y less probable than writing ‘houze’.

For the language model, great improvements can be obtained by adding more context to the word. For example if we misspelled ‘nouse’ it is very difficult to tell that the correct word is ‘house’ or ‘mouse’. But if we add more words “paint my nouse” it is evident that the word that we were looking for was ‘house’ (unless you have strange habits involving rodents). These are also called ngrams (but of words in this case, instead of letters). Google has offered a big collection of ngrams that are available to download, with their frequencies.

Lastly but not least, the performance can be improved by programming the script in java. Part of the algorithm was in python.

Bye!

As an update for all of you interested, Robert Muir told me in the Solr User list that there is a new spellchecker, DirectSpellChecker, that was in the trunk then and now should be part of Solr 3.1. It uses a similar technique to the one i presented in this entry without the performance loses.

References

[Kukich] Karen Kukich – Techniques for automatically correcting words in text – ACM Computing Surveys – Volume 24 Issue 4, Dec. 1992

[Cucerzan] S. Cucerzan and E. Brill Spelling correction as an iterative process that exploits the collective knowledge of web users. July 2004

Peter Norvig – How to Write a Spelling Corrector

Solr Spellchecker internals (now with tests!)

January 18, 2011

Let’s talk about spellcheckers. A spellchecker, as you may know, is that device that tells you whether you misspelled or not a word, and makes you some suggestions. One of the first spellcheckers that i remember seeing is the MS Word spellchecker. One could say that MS Word defined the “standard interface” of word processors spellcheckers: you misspelled a word, it was underlined with a zig zag style line in red, and if you right clicked on it, a list of suggested words appeared. I have seen this interface in many different software, for example, google docs.

Another more modern example is the famous “Did you mean” from Google. You type some words like “britny spears” and google would suggest “Britney Spears”. It appears that a lot of people have issues spelling Britney Spears. But Google is different. As usual, they use artificial intelligence algorithms to suggest misspelled words. Google algorithms are the closest you’ll get to magic in computer engineering.
But today I’m going to talk about Solr SpellChecker. In contrast with from google, Solr spellcheker isn’t much more than a pattern similarity algorithm. You give it a word and it will find similar words. But what is interpreted as “similar” by Solr? The words are interpreted just as an array of characters, so, two words are similar if they have many coincidences in their character sequences. That may sound obvious, but in natural languages the bytes (letters) have little meaning. It is the entire word that has a meaning. So, Solr algorithms won’t even know that you are giving them words. Those byte sequences could be sequences of numbers, or sequences of colors. Solr will find the sequences of numbers that have small differences with the input, or the sequences of colors, etc. By the way, this is not the approach that Google follows. Google knows the frequent words, the frequent misspelled words, and the frequent way humans make mistakes. It is my intention to talk about these interesting topics in a next post, but now let’s study how solr spellchecker works in detail, and then make some tests.

Solr spellchecker follows the same strategy as many other spellcheckers. It has a dictionary of correct spelled terms (correct by definition, if there is a misspelled word in the dictionary that will pass as a correct word). When somebody asks for suggestions to a word, Solr Spellchecker first obtains a list of candidate words and then ranks those candidates according to some criteria. The first step is accomplished by ngrams. A ngram is a substring in a string. For example if you have the word ‘house’, some ngrams would be ‘hou’, ‘ous’, ‘se’ (there are many other ngrams of differents lenghts, i’ve shown you only three of them). Two similar words will have many matching ngrams: ‘mouse’ also has ‘ous’ and ‘se’ but not ‘hou’. What Solr does is create a Lucene index of the words in the dictionary and filter them with ngram filters. So when you ask for suggestions for “house”, Solr searches ‘ho’ OR ‘ou’ OR ‘us’ OR ‘se’ OR ‘hou’ OR ‘ous’ OR ‘use’ OR ‘hous’ OR ‘ouse’ and, because Solr ranks boolean querys by the document with more coincidences, what you’ll get is a list of some similar words from our dictionary.

How does Solr rank the words afterwards? There is something called edit distance that tells us how many operations we have to perform in order to transform a word into another. By operations we understand insertions, deletions, or modifications of single characters. There are many algorithms to find the edit distance, one is Levenshtein (that is the default algorithm used in Solr). These algorithms are computationally complex, and that’s the reason why Solr doesn’t use them as the first choice in selecting the suggestions from among all the words in the dictionary. The dictionary is reduced and then this “difficult” ranking process is performed.
Perhaps, now you understand what I meant by “solr spellchecker only find similar byte arrays”. You never introduce information about our natural language into the algorithm and the only thing you provide is “a set of byte secuences” (ie a dictionary)

So far, so well. Does this approach work? Yes. Could it work better? Of course! And we have a lot of things that we can do to improve the algorithm. But first, let’s try to make this look scientific (if you remember, that was the idea of internet in the first place…) We need tests to see where we are standing. Something that I find boring is moving from the theoretical side to the experimental side. But that is a must in this that we call research. So, next I present a series of different tests that I performed on a Solr instance (that we have for experimental purposes) of the wikipedia (I recommend reading this post about how we indexed wikipedia, for all of you trying to index huge amounts of text)
I created a dictionary using the words from wikipedia and then tested a lot of different misspelled words taken from different origins.

For each test case, i created a small Python script that simply queries every single misspelled word against Solr, and counts in which place the correct spelled word returns. The test case includes the correct expected word. You can download the source from here.

The first set is a synthetic misspelled word list, that I created from a dictionary taken from Internet (it is a dictionary from Ispell) and applying modifications of “edit distance 1” to the words. I used part of the algorithm by Peter Norvig, from his excelent article on spellcheckers

Total Word processed: 19708
Found in the first 1 positions: 53%
Found in the first 2 positions: 58%
Found in the first 3 positions: 60%
Found in the first 10 positions: 61%

That means that 53% of the words were properly corrected in the first suggestion, 58% in the first two, and so on. Pretty awful results even with an easy dataset. But let’s try something more difficult.

Aspell is the GNU spellchecker library from GNU. They provide a dataset that they use to test their library. They have very good results, but they use a different method.
I tried that library against our test environment and this is the result

Total Word processed: 547
Found in the first 1 positions: 27%
Found in the first 2 positions: 33%
Found in the first 3 positions: 37%
Found in the first 10 positions: 45%

Even worse. They do not specify the origin of these words. A good test would be using real mistakes from real humans. These were taken from a freely available file. I tested a list of common misspelled words from collage students, a list of typos, and a list of known spelling errors in the English language (all that can be seen in the Readme file that comes with the download, i don’t want to extend in this). The format of some of these files needed to be converted. In the previous code download I included the scripts for this.

MASTERS Misspellings of about 260 words made in  spelling  tests  by  600 students  in  Iowa  in  the  1920’s – 200 8th graders, 200 high-school seniors and 200 college seniors

Total Word processed: 13020
Found in the first 1 positions: 27 %
Found in the first 2 positions: 35 %
Found in the first 3 positions: 38 %
Found in the first 10 positions: 44 %

SHEFFIELD A list of about 380 misspellings,  mostly  keying  errors,  taken from  typewritten or computer-terminal input, collected from staff and students  in  the  Department  of  Information  Studies  of  Sheffield University  by  Angell,  Freund  and  Willett  as  part  of a piece of
research into spelling correction.

Total Word processed: 384
Found in the first 1 positions: 57 %
Found in the first 2 positions: 62 %
Found in the first 3 positions: 65 %
Found in the first 10 positions: 70 %

FAWTHROP A compilation of four  collections  of  American  spelling  errors, already in published form.

Total Word processed: 809
Found in the first 1 positions: 44 %
Found in the first 2 positions: 50 %
Found in the first 3 positions: 52 %
Found in the first 9 positions: 55 %

The best result and very similar to my first test is the one of typos. That’s because typos are generally words with edit distance 1 to real words (you don’t usually make two typos in the same word) The others are pretty bad. This scenario is the same that anyone indexing a big document corpus (as wikipedia) and creating an index with it would do, and that is the effectiveness that he’ll get in the spellchecker.

What could be the reason of these results? Probably the results are bad because Solr spellchecker doesn’’t know anything about the natural languange that we call English.
How can you improve these results? With a different approach to spellchecking. There are algorithms that find words that sound similar to making suggestions. How a word sounds adds a lot of information about the natural language (and the psychology of making mistakes) to the system! (this is the aproach followed by GNU Aspell). Other algorithms use the information theory that Shannon created and consider that we humans are noisy channels that introduce noise (spelling mistakes) to words. If you introduce natural languages information via statistics, you can get better results (like in the article by Peter Norvig)
In future posts we’ll implement some of these algorithms and (following the scientific tradition) we’ll test them.
I’ll see you in the future!!!

Hello world!

December 27, 2010

Welcome to my first blog! I’ve had the idea of writing a blog for a very long time, and for some reason i never started working on it. Probably because i thought that nobody would be interested in what i had to write. Or more possibly because i’m lazy.

Anyway, in my new job (which maybe will be the subject of another post) they give a lot of importance to the research and development. And they encourage us to keep a blog and write about all things that we came up with. But this is just the excuse that somebody lazy like me needs to start a new project: my so anticipated blog in this case.

This blog will be mainly of geek stuff. Written by a geek and aimed to geeks. The software that we use is Solr. This is a big search application that is based on Lucene, which is an inverted index. Oddly, Solr doesn’t have a lot of bibliography as other Apache projects has. I found only one book by Packt Publishing, and nothing else. Of course there is a lot of documentation on the website, but that’s not the same. So, part of this blog will be about the researching part of my work, creating benchmarks and posting the results and all that. And also about topics that have nothing to do with solr but I always wanted to talk about, just because I found them interesting.

Finally I tell you that English is not my native language, but as we all know, Esperanto didn’t have the success that it creator expected and English was adopted as the universal language (natural selection perhaps). So i’ll be writing all my posts in English, and that’s why you’ll find some strange grammar and vocabulary in my blog!!!

I hope you all like this blog-project!


Follow

Get every new post delivered to your Inbox.