A Gentle Introduction to Azure Search

Microsoft Azure team released Azure Search as a preview product a few days ago, an hosted search service solution by Microsoft. Azure Search is a suitable product if you are dealing with high volume of data (millions of records) and want to have efficient, complex and clever search on those chunk of data. In this post, I will try to lay out some fundamentals about this service with a very high level introduction.
10 September 2014
8 minutes read

Related Posts

With many of the applications we build as software developers, we need our data to be exposed and we want that data to be in an easy reach so that the user of the application can find what they are looking for easily. This task is especially tricky if you have high amount of data (millions, even billions) in your system. At that point, the application needs to give user a great and flawless experience so that the user can filter down the results based on what they are actually looking for. Don't we have solutions to address this problems? Of course, we do and solutions such as Elasticsearch and Apache Solr are top notch problem solvers for this matter. However, hosting these products on your environment and making them scalable is completely another job.

To address these problems, Microsoft Azure team released Azure Search as a preview product a few days ago, an hosted search service solution by Microsoft. Azure Search is a suitable product if you are dealing with high volume of data (millions of records) and want to have efficient, complex and clever search on those chunk of data. If you have worked with a search engine product (such as Elasticsearch, Apache Solr, etc.) before, you will be too much comfortable with Azure Search as it has some many similar features. In fact, Azure Search is on top of Elasticsearch to provide its full-text search function. However, you shouldn't see this brand-new product as hosted Elasticsearch service on Azure because it has its completely different public interface.

In this post, I will try to lay out some fundamentals about this service with a very high level introduction. I’m hoping that it’s also going to be a starting point for me on Azure Search blog posts :)

Getting to Know Azure Search Service

When I look at Azure Search service, I see it as four pieces which gives us the whole experience:

  • Search Service
  • Search Unit
  • Index
  • Document

Search service is the highest level of the hierarchy and it contains Provisioned search unit(s). Also, a few concepts are targeting the search service such as authentication and scaling.

Search units allow for scaling of QPS (Queries per second), Document Count and Document Size. This also means that search units are the key concept for high availability and throughput. As a side note, high availability requires at least 3 replicas for the preview.

Index is the holder for a collection of documents based on a defined schema which specifies the capabilities of the Index (we will touch on this schema later). A search service can contain multiple indexes.

Lastly, Document is the actual holder for the data, based on the index schema, which the document itself lives in. A document has a key and this key needs to be unique within the index. A document also has fields to represent the data. Fields of a document contain attributes and those attributes define the capabilities of the field such as whether it can be used to filter the results, etc. Also note that number of documents an index can contain is limited based on the search units the service has.

Windows Azure Portal Experience

Let's first have a look at the portal experience and how we can get a search service ready for our use. Azure Search is not available through the current Microsoft Azure portal. It's only available through the preview portal. Inside the new portal, click the big plus sign at the bottom left and then click "Everything".

Screenshot 2014-09-06 11.55.45

This is going to get you to "Gallery". From there click "Data, storage, cache + backup" and then click "Search" from the new section.

Screenshot 2014-09-06 11.59.16

You will have a nice intro about the Microsoft Azure Search service within the new window. Hit "Create" there.

Keep in mind that service name must only contain lowercase letters, digits or dashes, cannot use dash as the first two or last one characters, cannot contain consecutive dashes, and is limited between 2 and 15 characters in length. Other naming conventions about the service has been laid out here under Naming Conventions section.

When you come to selecting the Pricing Tier, it's time to make a decision about your usage scenario.

Screenshot 2014-09-06 12.06.52

Now, there two options: Standard and Free. Free one should be considered as the sandbox experience because it's too limiting in terms of both performance and storage space. You shouldn't try to evaluate the Azure Search service with the free tier. It's, however, great for evaluating the HTTP API. You can create a free service and use this service to run your HTTP requests against.

The standard tier is the one you would like to choose for production use. It can be scaled both in terms of QPS (Queries per Second) and document size through shards and replicas. Head to "Configure Search in the Azure Preview portal" article for more in depth information about scaling.

When you are done setting up your service, you can now get the admin key or the query key from the portal and start hitting the Azure Search HTTP (or REST, if you want to call it that) API.

Azure Search HTTP API

Azure Search service is managed through its HTTP API and it's not hard to guess that even the Azure Portal is using its API to manage the service. It's a lightweight API which understands JSON as the content type. When we look at it, we can divide this HTTP API into three parts:

Index Management part of the API allows us managing the indexes with various operations such as creating, deleting and listing the indexes. It also allow us to see some index statistics. Creating the index is probably going to be the first operation you will perform and it has the following structure:

POST https://{search-service-name}.search.windows.net/indexes?api-version=2014-07-31-Preview HTTP/1.1
User-Agent: Fiddler
api-key: {your-api-key}
Content-Type: application/json
Host:{search-service-name}.search.windows.net

{
	"name": "employees",
	"fields": [{
		"name": "employeeId",
		"type": "Edm.String",
		"key": true,
		"searchable": false
	},
	{
		"name": "firstName",
		"type": "Edm.String"
	},
	{
		"name": "lastName",
		"type": "Edm.String"
	},
	{
		"name": "age",
		"type": "Edm.Int32"
	},
	{
		"name": "about",
		"type": "Edm.String",
		"filterable": false,
		"facetable": false
	},
	{
		"name": "interests",
		"type": "Collection(Edm.String)"
	}]
}

With the above request, you can also spot a few more things which are applied to every API call we make. There is a header we are sending with the request: api-key. This is where you are supposed to put your api-key. Also, we are passing the API version through a query string parameter called api-version. Have a look at the Azure Search REST API MSDN documentation for further detailed information.

With this request, we are specifying the schema of the index. Keep in mind that schema updates are limited at the time of this writing. Although existing fields cannot be changed or deleted, new fields can be added at any time. When a new field is added, all existing documents in the index will automatically have a null value for that field. No additional storage space will be consumed until new documents are added to the index. Have a look at the Update Index API documentation for further information on index schema update.

After you have your index schema defined, you can now start populating your index with Index Population API. Index Population API is a little bit different and I honestly don’t like it (I have a feeling that Darrel Miller won’t like it, too :)). The reason why I don’t like it is the way we define the operation. With this HTTP API, we can add new document, update and remove an existing one. However, we are defining the type of the operation inside the request body which is so weird if you ask me. The other weird thing about this API is that you can send multiple operations in one HTTP request by putting them inside a JSON array. The important fact here is that those operations don’t run in transaction which means that some of them may succeed and some of them may fail. So, how do we know which one actually failed? The response will contain a JSON array indicating each operation’s status. Nothing wrong with that but why do we reinvent the World? :) I would be more happy to send batch request using the multipart content-type. Anyway, enough bitching about the API :) Here is a sample request to add a new document to the index:

POST https://{search-service-name}.search.windows.net/indexes/employees/docs/index?api-version=2014-07-31-Preview HTTP/1.1
User-Agent: Fiddler
api-key: {your-api-key}
Content-Type: application/json
Host: {search-service-name}.search.windows.net

{
	"value": [{
		"@search.action": "upload",
		"employeeId": "1",
		"firstName": "Jane",
		"lastName": "Smith",
		"age": 32,
		"about": "I like to collect rock albums",
		"interests": ["music"]
	}]
}

As said, you can send the operations in batch:

POST https://{search-service-name}.search.windows.net/indexes/employees/docs/index?api-version=2014-07-31-Preview HTTP/1.1
User-Agent: Fiddler
api-key: {your-api-key}
Content-Type: application/json
Host: {search-service-name}.search.windows.net

{
	"value": [{
		"@search.action": "upload",
		"employeeId": "2",
		"firstName": "Douglas",
		"lastName": "Fir",
		"age": 35,
		"about": "I like to build cabinets",
		"interests": ["forestry"]
	},
	{
		"@search.action": "upload",
		"employeeId": "3",
		"firstName": "John",
		"lastName": "Fir",
		"age": 25,
		"about": "I love to go rock climbing",
		"interests": ["sports", "music"]
	}]
}

Check out the great documentation about index population API to learn about it more.

Lastly, there are query and lookup APIs where you can use OData 4.0 expression syntax to define your query. Go and check out its documentation as well.

Even if the service is so new, there are already great things happening around it. Sandrino Di Mattia has two cool open source projects on Azure Search. One is RedDog.Search .NET Client and the other is the RedDog Search Portal which is a web based UI tool to manage your Azure Search service. The other one is from Richard Astbury: Azure Search node.js / JavaScript client. I strongly encourage you to check them out. There are also two great video presentations about Azure Search by Liam Cavanagh, a Senior Program Manager in the Azure Data Platform Incubation team at Microsoft.

Stop what you are doing and go watch them if you care about Azure Search. It will give you a nice overview about the product and those videos could be your starting point.

You can also view my talk on AzureConf 2014 about Azure Search: