Filtered by Tag (architecture)
If you have never been exposed to software system design challenges, you might be totally lost on even where to begin. Dive into this post to find out about what matters when it comes to software architecture and system design and how you can get your grip in this wide area of software engineering.
@ 02-23-2019
by Tugberk Ugurlu

If you have never been exposed to software software system design challenges, you might be totally lost on even where to begin. I believe in finding the limits to a certain extend first and then start getting your hands dirty. The way you can start this is by finding some interesting product or services (ideally you are a fan of), and learning about their implementations. You will be surprised that how simple they may look, they most probably involve great deal of complexity. Don’t forget: simple is usually complex and that’s OK™.

Photo by Isaac Smith on Unsplash

I believe the biggest suggestion I can give you while approaching to system design challenges is this: not to assume anything! You should pin down the facts and expectations from this system first. Some good questions to ask here are which will help you start this process:

  • What is the problem you are trying to solve? 
  • What is the the peak volume of users that will interact with your system?
  • What are the data write and read patterns going to be?
  • What are the expected failure cases, how do you plan to mitigate them?
  • What are the availability and consistency expectations?
  • Do you need to worry about any auditing, regulation aspects?
  • What type of sensitive data are you going to be storing?
These are just a questions few that have worked for me and the teams that I worked with over the years. Once you have answers to these questions (or any other which are relevant to the context you are in), then you should be starting to dive into the technical side of the problem.

Setting Your Baseline

What do I mean by the baseline here? Well, in this era of software development, most of the problems "can" be solved by already existing techniques and technologies. Knowing these to a certain extend will give you a head start when you are faced with similar problems. Remember, we are writing software to solve business' and our users' problems and the desire is to do that in a most straight-forward and simple way from a user experience point of view. Why do you need to remember this? It could well be your reality that you should solve problems in unique ways as you might be thinking "what's the point of me writing software then if I am here to follow a pattern?". The craft here is in the decision making process to define where to do what. Surely, we may have challenging, unique problems which we can face at certain times. However, if we have our baseline solid, we will surely know whether we should direct our efforts into finding out ways to solve the problems or further understand the depth of it.

I believe I have convinced you at this point now that having a solid knowledge on how some of the exciting systems are architecturally shaped is quite critical for you to progress on having some appreciation on the craft and a solid baseline.

OK, but where to start? Donne Martin has a GitHub repo called system-design-primer which helps you learn how to design large-scale systems and also prep for the system design interviews. Inside this, there is a section dedicated to real world architectures which also involves some system designs of well-known companies such as Twitter, Uber, etc.

However, before jumping into this, you might want to have some insights on what matters the most in the architectural challenges. This is important because there are A LOT of aspects involved in disambiguating a gnarly, ambiguous problem and solving it within the guidelines of a defined system. Jackson Gabbard, an ex-Facebook employee, has a 50 mins video on system design interviews based on his experience on interviewing hundreds of candidates at Facebook. Even if this is focused on the system design interview objective and what success looks like for that, it's still a very comprehensive resource on what matters the most when it comes to system design. There is also a write-up of this video.

Start Building up Your Data Storage and Retrieval Knowledge

Most of the time, the choice of how you decide to persist and serve data will play a crucial role on the performance of your system. Therefore, you should be able to understand the expectations around data writes and reads about your system first. Then, you should be able to assess these and convert that assessment into a choice. However, you can only do this effectively if you know the existing storage patterns. This essentially means having a good knowledge around database choices. 

Databases are really scalable and durable data structures. So, all your knowledge around data structures should be really beneficial around understanding the various database choices. For example, Redis is a data structures server, supporting different kinds of values. It allows you to work with the concept of data strictures such as sets and lists, and provides you to read data through commonly-known algorithms such as LRU in a durable and highly available fashion. 

Photo by Samuel Zeller on Unsplash

Once you get enough grip around the various data storage patterns, it's now time for you to get into data consistency and availability land. CAP theorem is the first thing you should try to have a good grip of, which you can polish it off by looking deeper into established consistency and availability patterns. These will allow you to have a wide spectrum when it comes to understanding data writes and reads are really very separate concerns and have separate challenges associated to them. By embracing several consistency and availability patterns, you can gain a lot of performance while serving the data to your applications.

Finally around data storage needs, you should also be aware of caching. Should it be both on the client and server?  What data will you cache?  And why?  How will you invalidate the cache?  (will it be based on time?  If so, how long?). This section of system-design-primer should be a good starting point on this topic. 

Communication Patterns

Systems are composed of various components, which can be different processes living inside the same physical node or different machines sitting at the separate parts in your network. Some of these resources might be private within your network but some needs to be accessed publicly by your consumers.

These resources needs to be able to communicate between them and to the outside world. In context of system design, this again introduces another set of unique challenges. Understanding how asynchronous workflows can help you and what are the various communication patterns available such as TCP, UDP, HTTP (which sits on top of TCP), etc. will help you understand the breadth of the problem space and solutions currently available. 

Photo by Tony Stoddard on Unsplash

When dealing with communication to the outside world, security is always another side-effect that you need to be aware of and actively deal with. 

Connection Distribution

I am not sure if this logical grouping makes sense here. I will go with it anyway since it’s the closest term that reflects what I want to cover here.
Systems are formed by gluing multiple components together, and how they communicate with each other often is designed through well-established protocols such as TCP and UDP. However, these protocols are often not enough on their own to cover the needs of today’s systems which can have high load and demands from our consumers. We often need ways to be able to distribute connections in order to handle the high load of our system.

Domain Name System (DNS) sits at the core of this distribution. A DNS translates a domain name such as to an IP address. Besides this, some DNS services can route traffic through various methods such as weighted round robin and latency-based to help distribute the load.

Load balancing is very vital and nearly every major system on the Web we interact with today sits behind one or multiple load balancers. Load balancers help us distribute incoming client requests to multiple instances of resources. There both hardware and software forms of load balancers but it’s often that you see software based ones used such as HAProxy and ELBReverse proxies are also very smilar to the concept of load balancing with some distinctive differences though. These differences will have an effect on your choice based the needs.

Content Delivery Networks (CDN) are also something which you should be aware of. A CDN is a globally distributed network of proxy servers, serving content from locations closer to the user. CDNs are usually preferred when you are serving static files such as JavaScript, CSS and HTML. It’s also common that you see cloud services offer traffic managers (such as Azure Traffic Manager) which gives you global distribution and reduced latency benefits for your dynamic content. However, these services are mostly beneficial if you have stateless web services.  

What About My Business Logic? Structuring Business Logic, Workflows and Components

Thus far, we talked about all the infrastructure related aspects of a system. These are the parts of your system which your users probably have no idea about and to be frank, they don't give a damn about them. What they care about is how they interact with your system, what they can achieve by doing so and how the system acts on behalf of them to make certain decisions and process their data.

As you might guess from this post’s title, I intended this blog post to be about software architecture and system design. Therefore, I wasn’t going to cover the software design patterns which are concerned with how the components are built. However, thinking about this more and more, it’s clear to me that the line between them are very blurred and usually both sides are interconnected. Take Event Sourcing for example. Once you adopt this software architecture pattern, it pretty much effects most parts of your system; how you persist data, what level consistency you choose for your system’s clients to deal with, how you shape the components within your system, so on and so forth. Therefore, I decided to touch on some of the design and architectural patterns related which directly concerns your business logic. Even if it’s going to be just touching the surface, it should be useful for you have some ideas. Here is a few of them:

Collaboration Approaches

It's highly unlikely that you are going to be the only one involved in a project where you need to be part of a system design process. Therefore, you need to be able to collaborate with other folks in your team, both inside and outside of your job function. There is also a breadth and depth of this surface area and as the technical leader, you should be able to address the concerns on each level by going into it with a required depth. The activities here may involve evaluating technology choices together or pinning down the business needs and understanding how the work needs to be parallelised.

Photo by Kaleidico on Unsplash

First and foremost, you need to have an accurate and shared understanding of what you are trying to achieve as a business goal and what moving parts involved in this aspect. Group modeling techniques such as event storming are powerful methods to accelerate this process and increases your changes of success. You may get into this process before or after you define your service boundaries, deepening on your product/service maturity stage. Based on the level of alignment you see here, you may want to facilitate a separate activity to define the Ubiquitous Language for the bounded context you are operating on. When it comes to communicating the architecture of your system, you may find the C4 model for software architecture from Simon Brown useful, especially when it comes to understanding what level of depth you should go into while visualising what you are trying to convey.

There are most probably other mature techniques available in this space. However, all will tie back to your domain understanding and your experience and knowledge around Domain-driven Design will prove to be handy. 

Some Other Resources

Here are some resources which may help you. These are not in any particular oder.

Long time ago (about 5 years, at least), I contributed an article to SignalR wiki about scaling SignalR with Redis. You can still find the article here. I also blogged about it here. However, over time, pictures got lost there. I got a few requests from my readers to refresh those images and I was luckily able to find them :) I decided to publish that article here so that I would have a much better control over the content.
@ 08-08-2018
by Tugberk Ugurlu

Long time ago (about 5 years, at least), I contributed an article to SignalR wiki about scaling a SignalR application with Redis. You can still find the article here. I also blogged about it here. However, over time, pictures got lost there. I got a few requests from my readers to refresh those images and I was lucky enough to be able to find them :) I decided to publish that article here so that I would have a much better control over the content. So, here is the post :)

Please keep in mind that this is a really old post and lots of things have evolved since then. However, I do believe the concepts still resonate and it’s valuable to show the ways of how to achieve this within a cloud provider’s context.

SignalR with Redis Running on a Windows Azure Virtual Machine

This wiki article will walk your through on how you can run your SignalR application in multiple machines with Redis as your backplane using Windows Azure Virtual Machines for scale out scenarios.

Creating the Windows Azure Virtual Machines

First of all, we will spin up our virtual machines. What we want here is to have two Windows Server 2008 R2 virtual machines for our SignalR application and we will name them as Web1-08R2 and Web2-08R2. We will have the IIS installed on both of these servers and at the end, we will load balance the request on port 80.

Our third virtual machine will be another Windows Server 2008 R2 only for our Redis server. We will call this server Redis-08R2.

To spin up the VMs, go to new Windows Azure Management Portal and hit New icon at the bottom-right corner.


Creating a virtual machine running Windows Server 2008 R2 is explained here in details. We followed the same steps to create our first VM named Web1-08R2.

The second VM we will be creating has a slightly different approach than the first one. Under the hood, every virtual machine is a cloud service instance and we want to put our second VM (Web2-08R2) under the same cloud service that our first web VM is running under. To do that, we need to follow the same steps as explained inside the previously mentioned article but when we come to 3rd step in the creation wizard, we should chose Connect to existing Virtual Machine option this time and we should choose our first VM we have just created.


As the last step, we now need to create our redis VM which will be named Redis-08R2. We will follow the same steps as we did when we were creating our second web VM (Web2-08R2).

Setting Up Redis as a Windows Service

To use Redis on a Windows machine, we went to Redis on Windows prototype GitHub page and cloned the repository and followed the steps explained under How to build Redis using Visual Studio section.

After you build the project, you will have all the files you need under msvs\bin\release path as zip files. file will contain the redis server, redis command line interface and some other stuff. file will contain the msi file to install redis as a windows service. You can just copy those zip files to your Redis VM and extract under c:\redis\bin. Then follow the steps:

  • Currently, there is a bug in the RedisWatcher installer and if you don't have Microsoft Visual C++ 2010 Redistributable Package installed on your machine, the service won't start. So, I installed it first.

  • Copy this redis.conf file and put it under c:\redis\bin directory. Open it up and add a password by adding the following line of code:

    requirepass 1234567

    Take this note into considiration when you are setting up your redis password:

    Warning: since Redis is pretty fast an outside user can try up to 150k passwords per second against a good box. This means that you should use a very strong password otherwise it will be very easy to break.

  • Then, extract the somewhere and run the InstallWatcher.msito install the service.

  • Navigate to C:\Program Files (x86)\RedisWatcher directory. You will see a file named watcher.conf inside this directory. Open this file up and replace the entire file with the following text. Only difference here is that we are supplying the redis.conf file directory for the server to use:

    exepath c:\redis\bin
    exename redis-server.exe
     workingdir c:\redis\inst1
     runmode hidden
     saveout 1
     cmdparms c:\redis\bin\redis.conf
  • Create a folder named inst1 under c:\redis because we have specified this folder as working directory for our redis instance.

  • When you do a search against windows services in PowerShell, you will see RedisWatcherSvc service is installed.


  • Run the following PowerShell command to start the service for the first time.

    (Get-Service -Name RedisWatcherSvc).Start()

Now we have a Redis server running on our VM. To test if it is actually running, open up a windows command window under c:\redis\bin and run the following command (assuming you set your password 1234567):

redis-cli -h localhost -p 6379 -a 1234567

Now, you have a redis client running.


Ping the redis to see if you are really authenticated:


Now, we are nearly set. As a last step in our redis server, we need to open up TCP port 6379 for external communication. You can do this under Windows Firewall with Advanced Security window as explained here.


Communicating Through Internal Endpoints Between Windows Azure Virtual Machines Under Same Cloud Service

When you are inside one of your web VMs, you can simply look up the redis VM by hostname.


The hostname will resolve to DIP (Dynamic IP Address) which Windows Azure will use internally. We can configure public endpoints through Windows Azure Management Portal easily but in that case, we would be opening redis to the whole world. Also, if we communicate to our redis server through VIP (Virtual IP Address), we would always go through the load balancer which has its own additional cost.

So, we can easily connect to our redis server from any other connected VM by hostname.

The SignalR Application with Redis

Our SignalR application will not be that much different from a normal SignalR application thanks to SignalR.Redis project. All you need to do is to add the SignalR.Redis nuget package into your application and configure SignalR to use Redis as the message bus inside the Application_Start method in Global.asax.cs file:

protected void Application_Start(object sender, EventArgs e)
    // Hook up redis
    string server = ConfigurationManager.AppSettings["redis.server"];
    string port = ConfigurationManager.AppSettings["redis.port"];
    string password = ConfigurationManager.AppSettings["redis.password"];

    GlobalHost.DependencyResolver.UseRedis(server, Int32.Parse(port), password, "SignalR.Redis.Sample");

For our demo, the AppSettings should look like as below:

    <add key="redis.server" value="Redis-08R2" />
    <add key="redis.port" value="6379" />
    <add key="redis.password" value="1234567" />

I put the application under IIS on our both web servers (Web1-08R2 and Web2-08R2) and configured them to run under .NET Framework 4.0 integrated application pool.

For this demo, I am using the Redis.Sample chat application included inside the SignalR.Redis project.

Let's test them quickly before going public. I fired the both web applications inside the servers and here is the result:


Perfectly running! Let's open them up to the world.

Opening up the Port 80 and Load Balancing the Requets

Our requirement here is to make our application reachable over HTTP and at the same time, we want to load balance the request between our two web servers.

To do that, we need to go to Windows Azure Management portal and set up the TCP endpoints for port 80.

First, we navigate to dashboard of our Web1-08R2 VM and hit Endpoints from the dashboard menu:


From there, hit the End Endpoint icon at the bottom of the page:


A wizard is going to appear on the screen:


Click the right-arrow icon and go to next step which is the last one and we will enter the port details there:


After that, our endpoint will be created:


Follow the same steps of Web2-08R2 VM as well and open the Add Endpoint wizard. This time, we will be able to select Load-balance traffic on an existing port. Chose the previously created port and continue:


At the last step, enter the proper details and hit save:


We will see our new endpoint is being crated but this time Load Balanced column indicates Yes.


As we configured our web applications without a host name and they are exposed through port 80, we can directly run reach our application through the URL or Public Virtual IP Address (VIP) which is provided to us. When we run our application, we should see it running as below:


No matter which server it goes, the message will be broadcasted to every client because we will be using Redis as a message bus.


I want to share a few thoughts that I have been keeping to myself on showing progress for long-running asyncronous operations on a system where individual events can be sent during ongoing operations.
@ 07-30-2016
by Tugberk Ugurlu

With today's modern applications, we end up with lots of asynchronous operations happening and we somehow have to let the user know what is happening. This, of course, depends on the requirements and business needs.

Let's look at the Amazon case. When you buy something on Amazon, Amazon tells you that your order has been taken. However, it doesn't really mean that you have actually purchased it as the process of taking the payment has not been kicked off yet. The payment process probably goes through a few stages and at the end, it has a few outcome options like success, faılure, etc. Amazon only gives the user the change to know about the outcome of this process which is done through e-mail. This is pretty straightforward to implement (famous last words?) as it doesn't require things like real-time process notifications, etc..

On the other hand, creating a VM on Microsoft Azure has different needs. When you kick the VM creation operation through an Azure client like Azure Portal, it can give you real-time notifications on the operation status to show you in what stage the operation is (e.g. provisioning, starting up, etc.).

In this post, I will look at the later option and try to make a few points to prove that the implementation is a bit tricky. However, we will see a potential solution at the end and that will hopefully be helpful to you as well.

Problematic Flow

Let's look at a problematic flow:

  • User starts an operation by sending a POST/PUT/PATCH request to an HTTP endpoint.
  • Server sends the 202 (Accepted) response and includes a some sort of operation identifier on the response body.
  • Client subscribes to a notification channel based on this operation identifier.
  • Client starts receiving events on that notification channel and it decides on how to display them on the user interface (UI).
  • There are special event types which give an indication that the operation has ended (e.g. completed, failed, abandoned, etc.).
    • The client needs to know about these event types and handle them specially.
    • Whenever the client receives an event which corresponds to operation end event, it should unsubscribe from the notification channel for that specific operation.

We have a few problems with the above flow:

  1. Between the operation start and the client subscribing to a notification channel, there could be events that potentially could happen and the client is going to miss those.
  2. If the user disconnects from the notification channel for a while (for various reasons), the client will miss the events that could happen during that disconnect period.
  3. Similar to above situation, the user might entirely shut its client and opens it up again at some point later to see the progress. In that case, we lost the events that has happened before subscribing to the notification channel on the new session.

Possible Solution

The solution to all of the above problems boils down to persisting the events and opening up an HTTP endpoint to pull all the past events per each operation. Let's assume that we are performing payment transactions and want to give transparent progress on this process. Our events endpoint would look similar to below:

GET /payments/eefcb363/events

          "id": "c12c432c",
          "type": "ProcessingStarted",
          "message": "The payment 'eefcb363' has been started for processing by worker 'h8723h7d'.",
          "happenedAt": "2016-07-30T11:05:26.222Z"
          "details": {
               "paymentId": "eefcb363",
               "workerId": "h8723h7d"

          "id": "6bbb1d50",
          "type": "FraudCheckStarted",
          "message": "The given credit card details are being checked against fraud for payment 'eefcb363' by worker 'h8723h7d'.",
          "happenedAt": "2016-07-30T11:05:28.779Z"
          "details": {
               "paymentId": "eefcb363",
               "workerId": "h8723h7d"

          "id": "f9e09a83",
          "type": "ProgressUpdated",
          "message": "40% of the payment 'eefcb363' has been processed by worker 'h8723h7d'.",
          "happenedAt": "2016-07-30T11:05:29.892Z"
          "details": {
               "paymentId": "eefcb363",
               "workerId": "h8723h7d",
               "percentage": 40

This endpoint allows the client to pull the events that have already happened, and the client can mix these up with the events that it receives through the notification channel where the events are pushed in the same format.

With this approach, suggested flow for the client is to follow the below steps in order to get 100% correctness on the progress and events:

  • User starts an operation by sending a POST/PUT/PATCH request to an HTTP endpoint.
  • Server sends the 202 (Accepted) response and includes a some sort of operation identifier (which is 'eefcb363' in our case) on response body.
  • Client subscribes to a notification channel based on that operation identifier.
  • Client starts receiving events on that channel and puts them on a buffer list.
  • Client then sends a GET request based on the operation id to get all the events which have happened so far and puts them into the buffer list (only the ones that are not already in there).
  • Client can now start displaying the events from the buffer to the user (optionally in chronological order) and keeps receiving events through the notification channel; updates the UI simultaneously.

Obviously, at any display stage, the client needs to honor that there are special event types which give an indication that the operation has ended. The client needs to know about these types and handle them specially (e.g. show the completed view, etc.). Besides this, whenever the client receives an event which corresponds to operation end event, it should unsubscribe from the notification channel for that specific operation.

However, the situation can be a bit easier if you only care about percentage progress. If that's case, you may still want to persist all events but you only care about the latest event on the client. This makes your life easier and if you decide that your want to be more transparent about showing the progress, you can as you already have all the data for this. It is just a matter of exposing them with the above approach.


Under any circumstances, I would suggest you to persist operation events (you can even go further with event sourcing and make this a natural process). However, your use case may not require extensive and transparent progress reporting through the client. If that's the case, it will certainly make your implementation a lot simpler. However, you can change your mind later and go with a different progress reporting approach since you have been already persisting all the events.

Finally, please share your thoughts on this if you see a better way of handling this.

Hi 🙋🏻‍♂️ I'm Tugberk Ugurlu.
Coder 👨🏻‍💻, Speaker 🗣, Author 📚, Microsoft MVP 🕸, Blogger 💻, Software Engineering at Deliveroo 🍕🍜🌯, F1 fan 🏎🚀, Loves travelling 🛫🛬
Lives in Cambridge, UK 🏡