Turkish I Problem on RavenDB and Solving It with Custom Lucene Analyzers

Yesterday, I ran into a Turkish I problem on RavenDB and here is how I solved It with a custom Lucene analyzer

16 July 2013

5 minutes read

RavenDB, by default, uses a custom Lucene.Net analyzer named LowerCaseKeywordAnalyzer and it makes all your queries case-insensitive which is what I would expect. For example, the following query find the User whose name property is set to "TuGbErK":

class Program
{
     static void Main(string[] args)
     {
          const string DefaultDatabase = "EqualsTryOut";
          IDocumentStore store = new DocumentStore
          {
               Url = "http://localhost:8080",
               DefaultDatabase = DefaultDatabase
          }.Initialize();
          store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);
          using (var ses = store.OpenSession())
          {
               var user = new User { 
                  Name = "TuGbErK", 
                  Roles = new List<string> { "adMin", "GuEst" } 
               };
               
               ses.Store(user);
               ses.SaveChanges();
               //this finds name:TuGbErK
               var user1 = ses.Query<User>()
                  .Where(usr => usr.Name == "tugberk")
                  .FirstOrDefault();
          }
     }
}
public class User
{
     public string Id { get; set; }
     public string Name { get; set; }
     public ICollection<string> Roles { get; set; }
}

The problem starts appearing here where you have a situation that requires you to store Turkish text. You may ask why at this point, which makes sense. The problem is related to well-known Turkish "I" problem. Let's try to produce this problem with an example on RavenDB.

class Program
{
    static void Main(string[] args)
    {
        const string DefaultDatabase = "EqualsTryOut";
        IDocumentStore store = new DocumentStore
        {
            Url = "http://localhost:8080",
            DefaultDatabase = DefaultDatabase
        }.Initialize();

        store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);

        using (var ses = store.OpenSession())
        {
            var user = new User { 
                Name = "Irmak", 
                Roles = new List<string> { "adMin", "GuEst" } 
            };
            
            ses.Store(user);
            ses.SaveChanges();

            //This fails dues to Turkish I
            var user1 = ses.Query<User>()
                .Where(usr => usr.Name == "irmak")
                .FirstOrDefault();

            //this finds name:Irmak
            var user2 = ses.Query<User>()
                .Where(usr => usr.Name == "IrMak")
                .FirstOrDefault();
        }
    }
}

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
    public ICollection<string> Roles { get; set; }
}

Here, we have the same code but the Name value we are storing is different: Irmak. Irmak is a Turkish name which also means "river" in English (which is totally not the point here) and it starts with the Turkish letter "I". The lowercased version of this letter is "ı" and this is where the problem arises because if you are lowercasing this character in an invariant culture manner, it will be transformed as "i", not "ı". This is what RavenDB is doing with its LowerCaseKeywordAnalyzer and that's why we can't find anything with the first query above where we searched against "ırmak". In the second query, we can find the User that we are looking for as it will be lowercased into "irmak".

The Solution with a Custom Analyzer

The default analyzer that RavenDB using is the LowerCaseKeywordAnalyzer and it uses the LowerCaseKeywordTokenizer as its tokenizer. Inside that tokenizer, you will see the Normalize method which is used to lowercase a character in an invariant culture manner which causes a problem on our case here. AFAIK, there is no built in Lucene.Net tokenizer which handles Turkish characters well (I might be wrong here). So, I decided to modify the LowerCaseKeywordTokenizer according to my specific needs. It was a very naive and minor change which worked but not sure if I handled it well. You can find the source code of the TutkishLowerCaseKeywordTokenizer and TurkishLowerCaseKeywordAnalyzer classes on my Github repository.

Using a Custom Build Analyzer on RavenDB

RavenDB allows us to use custom analyzers and control the analyzer per-field. If you're going to use the built-in Lucene analyzer for a field, you can need to pass the FullName of the analyzer type just like in the below example which is a straight copy and paste from the RavenDB documentation.

store.DatabaseCommands.PutIndex(
    "Movies",
    new IndexDefinition
        {
            Map = "from movie in docs.Movies select new { movie.Name, movie.Tagline }",
            Analyzers =
                {
                    { "Name", typeof(SimpleAnalyzer).FullName },
                    { "Tagline", typeof(StopAnalyzer).FullName },
                }
        });

On the other hand, RavenDB also allows us to drop our own custom analyzers in:

"You can also create your own custom analyzer, compile it to a dll and drop it in in directory called "Analyzers" under the RavenDB base directory. Afterward, you can then use the fully qualified type name of your custom analyzer as the analyzer for a particular field."

There are couple things that you need to be careful of when going down this road:

You need to use the Lucene.Net assembly that your RavenDB server is using because RavenDB is using a custom build.
Drop your compiled assembly into the directory called "Analyzers" under the RavenDB base directory.
When you are configuring the specific fields to use your analyzer, be sure to pass the AssemblyQualifiedName of your custom analyzer class.

After I stopped my RavenDB server, I dropped my assembly, which contains the TurkishLowerCaseKeywordAnalyzer, into the "Analyzers" folder under the RavenDB base directory. At the client side, here is my code which consists of the index creation and the query:

class Program
{
    static void Main(string[] args)
    {
        const string DefaultDatabase = "EqualsTryOut";
        IDocumentStore store = new DocumentStore
        {
            Url = "http://localhost:8080",
            DefaultDatabase = DefaultDatabase
        }.Initialize();

        IndexCreation.CreateIndexes(typeof(Users).Assembly, store);
        store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);

        using (var ses = store.OpenSession())
        {
            var user = new User { 
                Name = "Irmak", 
                Roles = new List<string> { "adMin", "GuEst" } 
            };
            ses.Store(user);
            ses.SaveChanges();

            //this finds name:Irmak
            var user1 = ses.Query<User, Users>()
                .Where(usr => usr.Name == "irmak")
                .FirstOrDefault();
        }
    }
}

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
    public ICollection<string> Roles { get; set; }
}

public class Users : AbstractIndexCreationTask<User>
{
    public Users()
    {
        Map = users => from user in users
                       select new 
                       {
                          user.Name 
                       };

        Analyzers.Add(
            x => x.Name, 
            typeof(LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer)
                .AssemblyQualifiedName);
    }
}

It worked like a charm. I'm hopping this helps you solve this annoying problem and please post your comment if you know a better way of handling this.

Turkish I Problem on RavenDB and Solving It with Custom Lucene Analyzers

The Solution with a Custom Analyzer

Using a Custom Build Analyzer on RavenDB

Resources

Tags