Sorted By: Month (7) and Year (2013)

Turkish I Problem on RavenDB and Solving It with Custom Lucene Analyzers

Yesterday, I ran into a Turkish I problem on RavenDB and here is how I solved It with a custom Lucene analyzer
2013-07-16 14:37
Tugberk Ugurlu


RavenDB, by default, uses a custom Lucene.Net analyzer named LowerCaseKeywordAnalyzer and it makes all your queries case-insensitive which is what I would expect. For example, the following query find the User whose name property is set to "TuGbErK":

class Program
{
     static void Main(string[] args)
     {
          const string DefaultDatabase = "EqualsTryOut";
          IDocumentStore store = new DocumentStore
          {
               Url = "http://localhost:8080",
               DefaultDatabase = DefaultDatabase
          }.Initialize();
          store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);
          using (var ses = store.OpenSession())
          {
               var user = new User { 
                  Name = "TuGbErK", 
                  Roles = new List<string> { "adMin", "GuEst" } 
               };
               
               ses.Store(user);
               ses.SaveChanges();
               //this finds name:TuGbErK
               var user1 = ses.Query<User>()
                  .Where(usr => usr.Name == "tugberk")
                  .FirstOrDefault();
          }
     }
}
public class User
{
     public string Id { get; set; }
     public string Name { get; set; }
     public ICollection<string> Roles { get; set; }
}

The problem starts appearing here where you have a situation that requires you to store Turkish text. You may ask why at this point, which makes sense. The problem is related to well-known Turkish "I" problem. Let's try to produce this problem with an example on RavenDB.

class Program
{
    static void Main(string[] args)
    {
        const string DefaultDatabase = "EqualsTryOut";
        IDocumentStore store = new DocumentStore
        {
            Url = "http://localhost:8080",
            DefaultDatabase = DefaultDatabase
        }.Initialize();

        store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);

        using (var ses = store.OpenSession())
        {
            var user = new User { 
                Name = "Irmak", 
                Roles = new List<string> { "adMin", "GuEst" } 
            };
            
            ses.Store(user);
            ses.SaveChanges();

            //This fails dues to Turkish I
            var user1 = ses.Query<User>()
                .Where(usr => usr.Name == "irmak")
                .FirstOrDefault();

            //this finds name:Irmak
            var user2 = ses.Query<User>()
                .Where(usr => usr.Name == "IrMak")
                .FirstOrDefault();
        }
    }
}

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
    public ICollection<string> Roles { get; set; }
}

Here, we have the same code but the Name value we are storing is different: Irmak. Irmak is a Turkish name which also means "river" in English (which is totally not the point here) and it starts with the Turkish letter "I". The lowercased version of this letter is "ı" and this is where the problem arises because if you are lowercasing this character in an invariant culture manner, it will be transformed as "i", not "ı". This is what RavenDB is doing with its LowerCaseKeywordAnalyzer and that's why we can't find anything with the first query above where we searched against "ırmak". In the second query, we can find the User that we are looking for as it will be lowercased into "irmak".

The Solution with a Custom Analyzer

The default analyzer that RavenDB using is the LowerCaseKeywordAnalyzer and it uses the LowerCaseKeywordTokenizer as its tokenizer. Inside that tokenizer, you will see the Normalize method which is used to lowercase a character in an invariant culture manner which causes a problem on our case here. AFAIK, there is no built in Lucene.Net tokenizer which handles Turkish characters well (I might be wrong here). So, I decided to modify the LowerCaseKeywordTokenizer according to my specific needs. It was a very naive and minor change which worked but not sure if I handled it well. You can find the source code of the TutkishLowerCaseKeywordTokenizer and TurkishLowerCaseKeywordAnalyzer classes on my Github repository.

Using a Custom Build Analyzer on RavenDB

RavenDB allows us to use custom analyzers and control the analyzer per-field. If you're going to use the built-in Lucene analyzer for a field, you can need to pass the FullName of the analyzer type just like in the below example which is a straight copy and paste from the RavenDB documentation.

store.DatabaseCommands.PutIndex(
    "Movies",
    new IndexDefinition
        {
            Map = "from movie in docs.Movies select new { movie.Name, movie.Tagline }",
            Analyzers =
                {
                    { "Name", typeof(SimpleAnalyzer).FullName },
                    { "Tagline", typeof(StopAnalyzer).FullName },
                }
        });

On the other hand, RavenDB also allows us to drop our own custom analyzers in:

"You can also create your own custom analyzer, compile it to a dll and drop it in in directory called "Analyzers" under the RavenDB base directory. Afterward, you can then use the fully qualified type name of your custom analyzer as the analyzer for a particular field."

There are couple things that you need to be careful of when going down this road:

After I stopped my RavenDB server, I dropped my assembly, which contains the TurkishLowerCaseKeywordAnalyzer, into the "Analyzers" folder under the RavenDB base directory. At the client side, here is my code which consists of the index creation and the query:

class Program
{
    static void Main(string[] args)
    {
        const string DefaultDatabase = "EqualsTryOut";
        IDocumentStore store = new DocumentStore
        {
            Url = "http://localhost:8080",
            DefaultDatabase = DefaultDatabase
        }.Initialize();

        IndexCreation.CreateIndexes(typeof(Users).Assembly, store);
        store.DatabaseCommands.EnsureDatabaseExists(DefaultDatabase);

        using (var ses = store.OpenSession())
        {
            var user = new User { 
                Name = "Irmak", 
                Roles = new List<string> { "adMin", "GuEst" } 
            };
            ses.Store(user);
            ses.SaveChanges();

            //this finds name:Irmak
            var user1 = ses.Query<User, Users>()
                .Where(usr => usr.Name == "irmak")
                .FirstOrDefault();
        }
    }
}

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
    public ICollection<string> Roles { get; set; }
}

public class Users : AbstractIndexCreationTask<User>
{
    public Users()
    {
        Map = users => from user in users
                       select new 
                       {
                          user.Name 
                       };

        Analyzers.Add(
            x => x.Name, 
            typeof(LuceneAnalyzers.TurkishLowerCaseKeywordAnalyzer)
                .AssemblyQualifiedName);
    }
}

It worked like a charm. I'm hopping this helps you solve this annoying problem and please post your comment if you know a better way of handling this.

Resources

Scaling out SignalR with a Redis Backplane and Testing It with IIS Express

Learn how easy to scale out SignalR with a Redis backplane and simulate a local web farm scenario with IIS Express
2013-07-02 11:01
Tugberk Ugurlu


SignalR was built with scale out in mind from day one and they ship some scale out providers such as Redis, SQL Server and Windows Azure Service Bus. There is a really nice documentation series on this at official ASP.NET SignalR web site and you can find Redis, Windows Azure Service Bus and SQL Server samples there. In this quick post, I would like to show you how easy is to get SignalR up and running in a scale out scenario with a Redis backplane.

Sample Chat Application

First of all, I have a very simple and stupid real-time web application. The source code is also available on GitHub if you are interested in: RedisScaleOutSample. Guess what it is? Yes, you’re right. It’s a chat application :) I’m using SignalR 2.0.0-beta2 on this sample and here is how my hub looks like:

public class ChatHub : Hub
{
    public void Send(string message)
    {
        Clients.All.messageReceived(message);
    }
}

A very simple hub implementation. Now, let’s look at the entire HTML and JavaScript code that I have:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>Chat Sample</title>
</head>
<body>
    <div>
        <input type="text" id="msg" /> 
        <button type="button" id="send">Send</button>
    </div>
    <ul id="messages"></ul>

    <script src="/Scripts/jquery-1.6.4.min.js"></script>
    <script src="/Scripts/jquery.signalR-2.0.0-beta2.min.js"></script>
    <script src="/signalr/hubs"></script>
    <script>
        (function () {

            var chatHub = $.connection.chatHub,
                msgContainer = $('#messages');

            chatHub.client.messageReceived = function (msg) {
                $('<li>').text(msg).appendTo(msgContainer);
            };

            $.connection.hub.start().done(function () {

                $('#send').click(function () {
                    var msg = $('#msg').val();
                    chatHub.server.send(msg);
                });
            });
        }());
    </script>
</body>
</html>

When I run the application, I can see that it works like a charm:

1

This’s a single machine scenario and if we want to move this application to multiple VMs, a Web Farm or whatever your choice of scaling out your application, you will see that your application misbehaving. The reason is very simple to understand actually. Let’s try to understand why.

Understanding the Need of Backplane

Assume that you have two VMs for your super chat application: VM-1 and VM-2. The client-a comes to your application and your load balancer routes that request to VM-1. As your SignalR connection will be persisted as long as it can be, you will be connected to VM-1 for any messages you receive (assuming you are not on Long Pooling transport) and send (if you are on Web Sockets). Then, client-b comes to your application and the load balancer routes that request to VM-2 this time. What happens now? Any messages that client-a sends will not be received by client-b because they are on different nodes and SignalR has no idea about any other node except that it’s executing on.

To demonstrate this scenario easily in our development environment, I will fire up the same application in different ports through IIS Express with the following script:

function programfiles-dir {
    if (is64bit -eq $true) {
        (Get-Item "Env:ProgramFiles(x86)").Value
    } else {
        (Get-Item "Env:ProgramFiles").Value
    }
}

function is64bit() {
    return ([IntPtr]::Size -eq 8)
}

$executingPath = (Split-Path -parent $MyInvocation.MyCommand.Definition)
$appPPath = (join-path $executingPath "RedisScaleOutSample")
$iisExpress = "$(programfiles-dir)\IIS Express\iisexpress.exe"
$args1 = "/path:$appPPath /port:9090 /clr:v4.0"
$args2 = "/path:$appPPath /port:9091 /clr:v4.0"

start-process $iisExpress $args1 -windowstyle Normal
start-process $iisExpress $args2 -windowstyle Normal

I’m running IIS Express here from the command line and it’s a very powerful feature if you ask me. When you execute the following script (which is run.ps1 in my sample application), you will have the chat application running on localhost:9090 and localhost:9091:

2

When we try to same scenario now by connecting both endpoints, you will see that it’s not working as it should be:

3

SignalR makes it really easy to solve this type of problems. In its core architecture, SignalR uses a pub/sub mechanism to broadcast the messages. Every message in SignalR goes through the message bus and by default, SignalR uses Microsoft.AspNet.SignalR.Messaging.MessageBus which implements IMessageBus as its in-memory message bus. However, this’s fully replaceable and it’s where you need to plug your own message bus implementation for your scale out scenarios. SignalR team provides bunch of backplanes for you to work with but if you can totally implement your own if none of the scale-out providers that SignalR team is providing is not enough for you. For instance, the community has a RabbitMQ message bus implementation for SignalR: SignalR.RabbitMq.

Hooking up Redis Backplane to Your SignalR Application

In order to test configure using Redis as the backplane for SignalR, we need to have a Redis server up and running somewhere. The Redis project does not directly support Windows but Microsoft Open Tech provides the Redis Windows port which targets both x86 and x64 bit architectures. The better news is that they distribute the binaries through NuGet: http://nuget.org/packages/Redis-64.

4

Now I have Redis binaries, I can get the Redis server up. For our demonstration purposes, running the redis-server.exe without any arguments with the default configuration should be enough:

5

The Redis server is running on port 6379 now and we can configure SignalR to use Redis as its backplane. First thing to do is to install the SignalR Redis Messaging Backplane NuGet package. As I’m using the SignalR 2.0.0-beta2, I will install the version 2.0.0-beta2 of Microsoft.AspNet.SignalR.Redis package.

6

Last thing to do is to write a one line of code to replace the IMessageBus implementation:

public class Startup
{
    public void Configuration(IAppBuilder app)
    {
        GlobalHost.DependencyResolver
            .UseRedis("localhost", 6379, string.Empty, "myApp");

        app.MapHubs();
    }
}

The parameters we are passing into the UseRedis method are related to your Redis server. For our case here, we don’t have any password and that’s why we passed string.Empty. Now, let’s compile the application and run the same PowerShell script now to stand up two endpoints which simulates a web farm scenario in your development environment. When we navigate the both endpoints, we will see that messages are broadcasted to all nodes no matter which node they arrive:

7

That was insanely easy to implement, isn’t it.

A Few Things to Keep in Mind

The purpose of the SignalR’s backplane approach is to enable you to serve more clients in cases where one server is becoming your bottleneck. As you can imagine, having a backplane for your SignalR application can affect the message throughput as your messages need to go through the backplane first and distributed from there to all subscribers. For high-frequency real-time applications, such as real-time games, a backplane is not recommended. For those cases, cleverer load balancers are what you would want. Damian Edwards has talked about SignalR and different scale out cases on his Build 2013 talk and I strongly recommend you to check that out if you are interested in.

Tags