Scalability – Udi Dahan – The Software Simplist

Logically Distributed, Physically Centralized

udidahan — Sun, 06 May 2012 22:56:45 +0000

When people pull back the covers on something like MSMQ, particularly its private queues (the way NServiceBus uses it), and they see that MSMQ is storing its messages in C:WindowsSystem32, well, they’re not particularly happy.

One of the reasons they worry about these types of distributed or federated queue-based solutions has to do with physical failures. The concern is that messages would be lost if there was a hard drive failure.

The preference for centralized message broker type solutions is that we can set it up on a nice RAID infrastructure that will take care of any physical reliability concerns. (Just so that we’re clear here – I’m talking about an single datacenter, possibly connected to a disaster recovery site.)

So, here’s the thing:

Virtualization

You see, in a virtualized production environment, the C drive of a virtual machine is physically in the image file of that VM, which is sitting on a SAN (storage area network).

What that means is that when a message is sent from one processing node to another, the data of that message ends up being written to the SAN, with all of its redundant disks. Even if one of the machines has a critical failure and cannot start up again, all the VMs that were running on it can be started up on a different machine without any message loss.

In fact, most virtualized environments have monitoring and management capabilities built-in so that the VMs will be brought up automatically on another machine. Even if you aren’t using messaging, there are so many other benefits that virtualization brings that you probably should be planning on putting it in, if you haven’t already.

Databases too

In fact, many people do the same thing with databases.
The file partitions on which the database server actually stores its data are on the SAN.

Think about that for a second.

All the data in messages flowing through your queues, and the data in the database, on a SAN. This gives you the ability to do a fully consistent backup of the entire system with SAN snapshots, not to mention ship those to your disaster recovery site.

In closing

Distributed solutions are often misunderstood.
Bad experiences in the past with MSMQ can color perceptions in the present.

The thing is that today’s infrastructure is set up to handle distributed solutions much better.
Developers no longer have to turn to centralized broker or database technologies to get the centralized backup and restore capabilities administrators look for.

If you’ve been avoiding NServiceBus for these reasons, give it a try. Not only will it make your life as a developer easier, combined with this virtualization thing, it will make your administrators life easier too.

NServiceBus in Insurance – Testimonial

udidahan — Fri, 30 Mar 2012 10:26:24 +0000

Getting more companies willing to go on the record about their usage of NServiceBus.

Update: Check out full list of customers.

Here’s a testimonial from a large insurance company whose name always had me wondering…

If P & C Insurance is the leading property and casualty insurance company in the Nordic countries with 3.6 million customers and 6400 employees. NServiceBus was chosen as the ESB for one of our core insurance systems and is currently processing around 100,000 insurance policies a month. We expect that number to double within a year.

The support NServiceBus provides around asynchronous messaging and publish/subscribe communication makes the implementation of Service Oriented Architectures very straightforward. While the performance and scalability of the platform is able to handle the massive loads we’re under, we’re just as pleased with how quickly developers are able to get started with NServiceBus – even those coming into the project later on benefit from the maintainability of NServiceBus-centric code. We already have over 30 developers using it.

Looking forwards, we will be using NServiceBus on more systems at ever larger scales.

Common CQRS Abuses

udidahan — Sun, 26 Feb 2012 16:23:34 +0000

Abuse #1

“I’m using CQRS because I need to scale.”

While CQRS may be more scalable than other more traditional architectures, the use of asynchronous communication often complicates the user interaction model causing users to not see the changes they made to data in the UI until later. Trying to compensate for this (by writing even more code) digs one deeper into the complexity hole.

When I point to non-collaborative subdomains and state “You don’t need CQRS for that”, the reason is that in these areas you don’t tend to have much read/write contention. While multiple users/actors may be working in parallel, they don’t touch the same set of data (or do so only very rarely).

In these environments, all you need is a scalable data storage technology – something designed to scale-out (unlike most relational databases). This can take the form of NOSQL databases like HBase and Cassandra. Often all you need is the UI to query that directly and show the results, and the same goes for persisting the data back – possibly with some basic validation and calculation code on the side.

No commands, events, DTOs, publish/subscribe, domain model, etc.

As Ayende says – JFHCI, just f-ing hard code it.

You’d be surprised how much of your data this approach can apply to.

With the time you save on all the less important stuff, you’ll have more time to apply CQRS the right way for the high-value/high-complexity parts of your system.

***

Just a final note, as registration for my course in New York is coming to a close in 2 weeks, I wanted to let you all know that the price for the course will be going up this April, after the course in Sydney. The reason for this is that the courses I run myself (at the current rate) have been cannibalizing attendees from the partner companies I do the course with.

I’ll be providing significant discounts to independent consultants (and others paying their own way) to try to keep things fair. Hope to see you there.

Go to the registration page.

The Myth Of “Infinite Scalability”

udidahan — Thu, 29 Dec 2011 09:58:52 +0000

Scalability is a topic near and dear to my heart.

Many a client seeks me out for the first time for help in this area.

Usually the request is for an amount substantially smaller than infinity.

It’s usually on the discussion groups and in conference presentations that infinity is brought into it.

The basics

The first issue with scalability is the use of the word as an adjective: scalable.

“Is the system scalable?”

Or the similar verb use: “Does it scale?”

The problem here is the implication that there is a yes/no answer to the question.

Scalability is not boolean.

Linear Scalability

When people talk about scalability, or a system being able to scale, they’re usually referring to a graph that looks something like this:

The red graph indicating a system that does not scale well, the green graph indicating one that does.

What is missing from this diagram are the labels of the axes.

The Y axis is Cost, Expense, or Money.
The X axis is usually the number of users (for internet-type companies).

Ultimately, scalability is a cost-function that will tell us how much it will cost to have the system support a certain number of users.

Linear scalability is when the cost of the next user is the same as the cost of the previous user. This means our system doesn’t have bottlenecks. This is what people usually mean when they say “infinite scalability”.

But there’s more

As many of the internet companies (and their investors) have realized over the years, there’s a difference between the number of users and the number of active users. It’s very easy to scale to a billion users when only 1000 of them are active at any given time.

To be more accurate, what we want is additional X-axes for things like total data managed by the system, number of requests per user, resource utilization per request, propagation speed (how quickly information entered by one user needs to be visible to others), and more.

Scalability is a multi-dimensional cost function, where part of an architects job is to figure out which dimensions are significant for the system/business, and what the expectation for growth is across each axis.

Preparing for “infinity”

Be careful not to optimize for only a single dimension – reality is a whole lot more complex.

There are so many other things to deal with as a system scales.

For example, do you really think you’re going to want your configuration entirely centralized? Putting everything in one place means easier management, yes, but it also means a mistake will instantly affect everyone. Is it worth the risk? Maybe instead of centralization, we could do with some automation that will allow a staged rollout of configuration changes with the ability to rollback.

The same goes for rolling out new versions, patches, and upgrades.

But that now means we may have multiple versions of the same system in production at the same time. How will that work? Will they all talk to the same database? How will we version the database then? If not, how will we handle state? Won’t this mean our code will have to be backwards compatible from one version to another? Isn’t that hard? Like, insanely hard?

Please, can we park the whole “infinite scalability” thing?
It’s really not the most important concern – not by a long shot.

When to avoid CQRS

udidahan — Fri, 22 Apr 2011 20:32:50 +0000

It looks like that CQRS has finally “made it” as a full blown “best practice”.

Please accept my apologies for my part in the overly-complex software being created because of it.

I’ve tried to do what I could to provide a balanced view on the topic with posts like Clarified CQRS and Race Conditions Don’t Exist.

It looks like that wasn’t enough, so I’ll go right out and say it:

Most people using CQRS (and Event Sourcing too) shouldn’t have done so.

Should we really go back to N-Tier?

When not using CQRS (which is the majority of the time), you don’t need N-Tier either.

You see, if you’re not in a collaborative domain then you don’t have multiple writers to the same logical set of data as an inherent property of your domain. As such, having a single database where all data lives isn’t really necessary.

Data is inherently partitioned by who owns it.

Let’s take the online shopping cart as an example. There aren’t any use cases where users operate on each others’ carts – ergo, not collaborative, therefore not a good candidate for CQRS. Same goes for user profiles, and tons of other cases.

So why is it that we need a separate tier to run our business logic?

Originally, the application server tier was introduced for improved scalability, but specifically around managing the connection pool to the database. Increasing numbers of clients (when each had its own user/account for connecting to the database) caused problems. Luckily, most web applications side-step this problem – that is, until someone got the idea that the web server was only supposed to run the UI layer, and the Business Logic layer would be on a separate application server tier.

Rubbish – see Fowler’s First Law of Distribution: Don’t.

Keep it all on one tier. Same goes for smart clients.
No, Silverlight, you don’t count – architecturally speaking, you’re a glorified browser.

But what about scalability?

In a non-collaborative domain, where you can horizontally add more database servers to support more users/requests/data at the same time you’re adding web servers – there is no real scalability problem (caveat, until you’re Amazon/Google/Facebook scale).

Database servers can be cheap – if using MySQL/SQL Express/others.

But what about the built-in event-log CQRS/ES gives us?

Architectural gold-plating / stealing from the business.

Who put you in a position to decide that development time and resources should be diverted from short-term business-value-adding features to support a non-functional requirement that the business didn’t ask for?

If you sat down with them, explaining the long-term value of having an archive of all actions in the system, and they said OK, build this into the system from the beginning, that would be fine. Most people who ask me about CQRS and/or Event Sourcing skip this step.

Finally, you can usually implement this specific requirement with some simple interception and logging. Don’t over-engineer the solution. If using messaging, you can get this by turning on journaling, or if you want to centralize this archive, NServiceBus can forward all messages to a specific queue.

Don’t forget that this storage has a cost – including administration. Nothing is free.

What about the “proof of correctness” in Event Sourcing

I’ve heard statements made that when you use the events that flowed into/through your system AS your system’s data, rather than transforming those events to some other schema (relational or otherwise) and storing the result – you can prove that your system behaves correctly.

Let me put it this way:

No programming technique used by humans will prevent those same humans from creating bugs.
No testing technique used by humans will prevent those same humans from not catching those bugs.
* Automated tests – see programming technique.

While having a full archive of all events can allow us to roll the system back to some state, fix a bug, and roll forwards, that assumes that we’re in a closed system. We have users which are outside the system. If a user made a decision based on data influenced by the bug, there’s no automated way for us to know that, or correct for it as we roll forwards.

In short, we’re interested in the business’ behavior – as composed of user and system behavior. No proof can exist.

Umm, so where should we use it

If you’ve uncovered a scenario where you’re wondering “first-one-wins, or last-one-wins”, that’s often a good candidate for a place where CQRS could make sense. Then re-read my Race Conditions Don’t Exist post.

Also, CQRS should not be your top-level architectural pattern – that would be SOA.
CQRS, if used at all, would be used inside a service boundary only.

Given that SOA guides us away from having a given 3rd normal form entity exist in any one service, it is unlikely that the building blocks of your CQRS design will be those kinds of entities. Most 3rd normal form one-to-many and many-to-many relationships simply do not exist when doing SOA and CQRS properly.

Therefore, I’m sorry to say that most sample application you’ll see online that show CQRS are architecturally wrong. I’d also be extremely wary of frameworks that guide you towards an entity-style aggregate root CQRS model.

In Summary

So, when should you avoid CQRS?

The answer is most of the time.

Here’s the strongest indication I can give you to know that you’re doing CQRS correctly: Your aggregate roots are sagas.

And the biggest caveat – the above are generalizations, and can’t necessarily be true for every specific scenario. If you’re Greg Young, then you probably can (and will) decide on your own on these matters. For everybody else, please take these warnings to heart. There have been far too many clients that have come to me all mixed up with their use CQRS in areas where it wasn’t warranted.

If you want to know everything you need to know to apply CQRS appropriately, please come to my course – there is so much unlearning to do first that just can’t happen via a series of blog posts.

CQRS Video Online

udidahan — Fri, 26 Feb 2010 09:42:45 +0000

A couple of weeks ago I gave a talk on Command/Query Responsibility Segregation in London.

The recording of the talk is online here.

There is one important thing that I didn’t have enough time to cover, but I want you to keep in mind as you’re watching this. It is that CQRS is applicable only *within* the context of a single service/BC – NOT across or between them.

Let me know what you think.

Scalability Podcast on Herding Code

udidahan — Mon, 11 Jan 2010 19:44:49 +0000

The great folks over at Herding Code were nice enough to interview me back in November as I was over in Paris giving my 5-day SOA course. We talked about quite a lot of topics related to scalability.

Click here for the full list of topics and to download the podcast.

Let me know what you think or any questions you may have in the comments.

Clarified CQRS

udidahan — Wed, 09 Dec 2009 14:57:19 +0000

After listening how the community has interpreted Command-Query Responsibility Segregation I think that the time has come for some clarification. Some have been tying it together to Event Sourcing. Most have been overlaying their previous layered architecture assumptions on it. Here I hope to identify CQRS itself, and describe in which places it can connect to other patterns.

Download as PDF – this is quite a long post.

Why CQRS

Before describing the details of CQRS we need to understand the two main driving forces behind it: collaboration and staleness.

Collaboration refers to circumstances under which multiple actors will be using/modifying the same set of data – whether or not the intention of the actors is actually to collaborate with each other. There are often rules which indicate which user can perform which kind of modification and modifications that may have been acceptable in one case may not be acceptable in others. We’ll give some examples shortly. Actors can be human like normal users, or automated like software.

Staleness refers to the fact that in a collaborative environment, once data has been shown to a user, that same data may have been changed by another actor – it is stale. Almost any system which makes use of a cache is serving stale data – often for performance reasons. What this means is that we cannot entirely trust our users decisions, as they could have been made based on out-of-date information.

Standard layered architectures don’t explicitly deal with either of these issues. While putting everything in the same database may be one step in the direction of handling collaboration, staleness is usually exacerbated in those architectures by the use of caches as a performance-improving afterthought.

A picture for reference

I’ve given some talks about CQRS using this diagram to explain it:

The boxes named AC are Autonomous Components. We’ll describe what makes them autonomous when discussing commands. But before we go into the complicated parts, let’s start with queries:

Queries

If the data we’re going to be showing users is stale anyway, is it really necessary to go to the master database and get it from there? Why transform those 3rd normal form structures to domain objects if we just want data – not any rule-preserving behaviors? Why transform those domain objects to DTOs to transfer them across a wire, and who said that wire has to be exactly there? Why transform those DTOs to view model objects?

In short, it looks like we’re doing a heck of a lot of unnecessary work based on the assumption that reusing code that has already been written will be easier than just solving the problem at hand. Let’s try a different approach:

How about we create an additional data store whose data can be a bit out of sync with the master database – I mean, the data we’re showing the user is stale anyway, so why not reflect in the data store itself. We’ll come up with an approach later to keep this data store more or less in sync.

Now, what would be the correct structure for this data store? How about just like the view model? One table for each view. Then our client could simply SELECT * FROM MyViewTable (or possibly pass in an ID in a where clause), and bind the result to the screen. That would be just as simple as can be. You could wrap that up with a thin facade if you feel the need, or with stored procedures, or using AutoMapper which can simply map from a data reader to your view model class. The thing is that the view model structures are already wire-friendly, so you don’t need to transform them to anything else.

You could even consider taking that data store and putting it in your web tier. It’s just as secure as an in-memory cache in your web tier. Give your web servers SELECT only permissions on those tables and you should be fine.

Query Data Storage

While you can use a regular database as your query data store it isn’t the only option. Consider that the query schema is in essence identical to your view model. You don’t have any relationships between your various view model classes, so you shouldn’t need any relationships between the tables in the query data store.

So do you actually need a relational database?

The answer is no, but for all practical purposes and due to organizational inertia, it is probably your best choice (for now).

Scaling Queries

Since your queries are now being performed off of a separate data store than your master database, and there is no assumption that the data that’s being served is 100% up to date, you can easily add more instances of these stores without worrying that they don’t contain the exact same data. The same mechanism that updates one instance can be used for many instances, as we’ll see later.

This gives you cheap horizontal scaling for your queries. Also, since your not doing nearly as much transformation, the latency per query goes down as well. Simple code is fast code.

Data modifications

Since our users are making decisions based on stale data, we need to be more discerning about which things we let through. Here’s a scenario explaining why:

Let’s say we have a customer service representative who is one the phone with a customer. This user is looking at the customer’s details on the screen and wants to make them a ‘preferred’ customer, as well as modifying their address, changing their title from Ms to Mrs, changing their last name, and indicating that they’re now married. What the user doesn’t know is that after opening the screen, an event arrived from the billing department indicating that this same customer doesn’t pay their bills – they’re delinquent. At this point, our user submits their changes.

Should we accept their changes?

Well, we should accept some of them, but not the change to ‘preferred’, since the customer is delinquent. But writing those kinds of checks is a pain – we need to do a diff on the data, infer what the changes mean, which ones are related to each other (name change, title change) and which are separate, identify which data to check against – not just compared to the data the user retrieved, but compared to the current state in the database, and then reject or accept.

Unfortunately for our users, we tend to reject the whole thing if any part of it is off. At that point, our users have to refresh their screen to get the up-to-date data, and retype in all the previous changes, hoping that this time we won’t yell at them because of an optimistic concurrency conflict.

As we get larger entities with more fields on them, we also get more actors working with those same entities, and the higher the likelihood that something will touch some attribute of them at any given time, increasing the number of concurrency conflicts.

If only there was some way for our users to provide us with the right level of granularity and intent when modifying data. That’s what commands are all about.

Commands

A core element of CQRS is rethinking the design of the user interface to enable us to capture our users’ intent such that making a customer preferred is a different unit of work for the user than indicating that the customer has moved or that they’ve gotten married. Using an Excel-like UI for data changes doesn’t capture intent, as we saw above.

We could even consider allowing our users to submit a new command even before they’ve received confirmation on the previous one. We could have a little widget on the side showing the user their pending commands, checking them off asynchronously as we receive confirmation from the server, or marking them with an X if they fail. The user could then double-click that failed task to find information about what happened.

Note that the client sends commands to the server – it doesn’t publish them. Publishing is reserved for events which state a fact – that something has happened, and that the publisher has no concern about what receivers of that event do with it.

Commands and Validation

In thinking through what could make a command fail, one topic that comes up is validation. Validation is different from business rules in that it states a context-independent fact about a command. Either a command is valid, or it isn’t. Business rules on the other hand are context dependent.

In the example we saw before, the data our customer service rep submitted was valid, it was only due to the billing event arriving earlier which required the command to be rejected. Had that billing event not arrived, the data would have been accepted.

Even though a command may be valid, there still may be reasons to reject it.

As such, validation can be performed on the client, checking that all fields required for that command are there, number and date ranges are OK, that kind of thing. The server would still validate all commands that arrive, not trusting clients to do the validation.

Rethinking UIs and commands in light of validation

The client can make of the query data store when validating commands. For example, before submitting a command that the customer has moved, we can check that the street name exists in the query data store.

At that point, we may rethink the UI and have an auto-completing text box for the street name, thus ensuring that the street name we’ll pass in the command will be valid. But why not take things a step further? Why not pass in the street ID instead of its name? Have the command represent the street not as a string, but as an ID (int, guid, whatever).

On the server side, the only reason that such a command would fail would be due to concurrency – that someone had deleted that street and that that hadn’t been reflected in the query store yet; a fairly exceptional set of circumstances.

Reasons valid commands fail and what to do about it

So we’ve got a well-behaved client that is sending valid commands, yet the server still decides to reject them. Often the circumstances for the rejection are related to other actors changing state relevant to the processing of that command.

In the CRM example above, it is only because the billing event arrived first. But “first” could be a millisecond before our command. What if our user pressed the button a millisecond earlier? Should that actually change the business outcome? Shouldn’t we expect our system to behave the same when observed from the outside?

So, if the billing event arrived second, shouldn’t that revert preferred customers to regular ones? Not only that, but shouldn’t the customer be notified of this, like by sending them an email? In which case, why not have this be the behavior for the case where the billing event arrives first? And if we’ve already got a notification model set up, do we really need to return an error to the customer service rep? I mean, it’s not like they can do anything about it other than notifying the customer.

So, if we’re not returning errors to the client (who is already sending us valid commands), maybe all we need to do on the client when sending a command is to tell the user “thank you, you will receive confirmation via email shortly”. We don’t even need the UI widget showing pending commands.

Commands and Autonomy

What we see is that in this model, commands don’t need to be processed immediately – they can be queued. How fast they get processed is a question of Service-Level Agreement (SLA) and not architecturally significant. This is one of the things that makes that node that processes commands autonomous from a runtime perspective – we don’t require an always-on connection to the client.

Also, we shouldn’t need to access the query store to process commands – any state that is needed should be managed by the autonomous component – that’s part of the meaning of autonomy.

Another part is the issue of failed message processing due to the database being down or hitting a deadlock. There is no reason that such errors should be returned to the client – we can just rollback and try again. When an administrator brings the database back up, all the message waiting in the queue will then be processed successfully and our users receive confirmation.

The system as a whole is quite a bit more robust to any error conditions.

Also, since we don’t have queries going through this database any more, the database itself is able to keep more rows/pages in memory which serve commands, improving performance. When both commands and queries were being served off of the same tables, the database server was always juggling rows between the two.

Autonomous Components

While in the picture above we see all commands going to the same AC, we could logically have each command processed by a different AC, each with it’s own queue. That would give us visibility into which queue was the longest, letting us see very easily which part of the system was the bottleneck. While this is interesting for developers, it is critical for system administrators.

Since commands wait in queues, we can now add more processing nodes behind those queues (using the distributor with NServiceBus) so that we’re only scaling the part of the system that’s slow. No need to waste servers on any other requests.

Service Layers

Our command processing objects in the various autonomous components actually make up our service layer. The reason you don’t see this layer explicitly represented in CQRS is that it isn’t really there, at least not as an identifiable logical collection of related objects – here’s why:

In the layered architecture (AKA 3-Tier) approach, there is no statement about dependencies between objects within a layer, or rather it is implied to be allowed. However, when taking a command-oriented view on the service layer, what we see are objects handling different types of commands. Each command is independent of the other, so why should we allow the objects which handle them to depend on each other?

Dependencies are things which should be avoided, unless there is good reason for them.

Keeping the command handling objects independent of each other will allow us to more easily version our system, one command at a time, not needing even to bring down the entire system, given that the new version is backwards compatible with the previous one.

Therefore, keep each command handler in its own VS project, or possibly even in its own solution, thus guiding developers away from introducing dependencies in the name of reuse (it’s a fallacy). If you do decide as a deployment concern, that you want to put them all in the same process feeding off of the same queue, you can ILMerge those assemblies and host them together, but understand that you will be undoing much of the benefits of your autonomous components.

Whither the domain model?

Although in the diagram above you can see the domain model beside the command-processing autonomous components, it’s actually an implementation detail. There is nothing that states that all commands must be processed by the same domain model. Arguably, you could have some commands be processed by transaction script, others using table module (AKA active record), as well as those using the domain model. Event-sourcing is another possible implementation.

Another thing to understand about the domain model is that it now isn’t used to serve queries. So the question is, why do you need to have so many relationships between entities in your domain model?

(You may want to take a second to let that sink in.)

Do we really need a collection of orders on the customer entity? In what command would we need to navigate that collection? In fact, what kind of command would need any one-to-many relationship? And if that’s the case for one-to-many, many-to-many would definitely be out as well. I mean, most commands only contain one or two IDs in them anyway.

Any aggregate operations that may have been calculated by looping over child entities could be pre-calculated and stored as properties on the parent entity. Following this process across all the entities in our domain would result in isolated entities needing nothing more than a couple of properties for the IDs of their related entities – “children” holding the parent ID, like in databases.

In this form, commands could be entirely processed by a single entity – viola, an aggregate root that is a consistency boundary.

Persistence for command processing

Given that the database used for command processing is not used for querying, and that most (if not all) commands contain the IDs of the rows they’re going to affect, do we really need to have a column for every single domain object property? What if we just serialized the domain entity and put it into a single column, and had another column containing the ID? This sounds quite similar to key-value storage that is available in the various cloud providers. In which case, would you really need an object-relational mapper to persist to this kind of storage?

You could also pull out an additional property per piece of data where you’d want the “database” to enforce uniqueness.

I’m not suggesting that you do this in all cases – rather just trying to get you to rethink some basic assumptions.

Let me reiterate

How you process the commands is an implementation detail of CQRS.

Keeping the query store in sync

After the command-processing autonomous component has decided to accept a command, modifying its persistent store as needed, it publishes an event notifying the world about it. This event often is the “past tense” of the command submitted:

MakeCustomerPerferredCommand -> CustomerHasBeenMadePerferredEvent

The publishing of the event is done transactionally together with the processing of the command and the changes to its database. That way, any kind of failure on commit will result in the event not being sent. This is something that should be handled by default by your message bus, and if you’re using MSMQ as your underlying transport, requires the use of transactional queues.

The autonomous component which processes those events and updates the query data store is fairly simple, translating from the event structure to the persistent view model structure. I suggest having an event handler per view model class (AKA per table).

Here’s the picture of all the pieces again:

Bounded Contexts

While CQRS touches on many pieces of software architecture, it is still not at the top of the food chain. CQRS if used is employed within a bounded context (DDD) or a business component (SOA) – a cohesive piece of the problem domain. The events published by one BC are subscribed to by other BCs, each updating their query and command data stores as needed.

UI’s from the CQRS found in each BC can be “mashed up” in a single application, providing users a single composite view on all parts of the problem domain. Composite UI frameworks are very useful for these cases.

Summary

CQRS is about coming up with an appropriate architecture for multi-user collaborative applications. It explicitly takes into account factors like data staleness and volatility and exploits those characteristics for creating simpler and more scalable constructs.

One cannot truly enjoy the benefits of CQRS without considering the user-interface, making it capture user intent explicitly. When taking into account client-side validation, command structures may be somewhat adjusted. Thinking through the order in which commands and events are processed can lead to notification patterns which make returning errors unnecessary.

While the result of applying CQRS to a given project is a more maintainable and performant code base, this simplicity and scalability require understanding the detailed business requirements and are not the result of any technical “best practice”. If anything, we can see a plethora of approaches to apparently similar problems being used together – data readers and domain models, one-way messaging and synchronous calls.

Although this blog post is over 3000 words (a record for this blog), I know that it doesn’t go into enough depth on the topic (it takes about 3 days out of the 5 of my Advanced Distributed Systems Design course to cover everything in enough depth). Still, I hope it has given you the understanding of why CQRS is the way it is and possibly opened your eyes to other ways of looking at the design of distributed systems.

Questions and comments are most welcome.

MySpace Architecture Considered Expensive

udidahan — Fri, 09 Oct 2009 21:24:09 +0000

I just finished listening to the Microsoft presentation on how they use the Concurrency & Coordination Runtime (CCR) in MySpace (the stated largest web site running .NET).

Some interesting numbers were stated in the talk.

Tens of thousands to hundreds of thousands of requests per second
Over 3 thousand web servers
Over a thousand mid-tier servers

No wonder most big web sites don’t run .NET. The Windows licenses would put them out of business.

Well, that is if you follow those same architectural practices.

I’ve written in the past of alternative architectural approaches that can scale to those levels at easily an order of magnitude less hardware (I think it’s closer to two OOMs) – here’s one of them on the topic of weather:

Building Super-Scalable Web Systems with REST.

By the way, the client quoted in that post is now well above 60 million users with only small incremental increases in hardware. Oh, and their running everything on Windows and .NET. The question is not “can it scale”, but rather “how much will it cost to scale”.

Architecture pays itself back faster than ever in the Web 2.0 world.

MSDN Magazine Smart Client Article

udidahan — Sat, 28 Mar 2009 19:16:39 +0000

My article on “optimizing a large-scale Software+Services application” has been published in the April edition of MSDN Magazine.

Here’s a short excerpt:

“We had to juggle occasional connectivity, data synchronization, and publish/subscribe all at the same time. We learned that we couldn’t solve all problems either client-side or server-side, but rather that an integrated approach was needed since any changes on one side needed corresponding changes on the other side.”

Messaging ROI

udidahan — Sun, 22 Feb 2009 10:12:59 +0000

There’s been some recent discussion as to the “cost” of messaging:

Greg Young asserts:

“I believe that this shows there to be a rather negligible cost associated with the use of such a model. There is however a small cost, this cost however I believe only exists when one looks at the system in isolation.”

Ayende adds his perspective:

“The cost of messaging, and a very real one, comes when you need to understand the system. In a system where message exchange is the form of communication, it can be significantly harder to understand what is going on.”

Of course, both these intelligent fellows are right. The reason for the apparent disparity in viewpoints has to do with which part of the following graph you look at. Ayende zooms in on the left side:

As systems get larger, though, the only way to understand them is by working at higher levels of abstraction. That’s where messaging really shines, as the incremental complexity remains the same by maintaining the same modularity as before:

In Ayende’s post, he follows the design I described a while back on using messaging for user management and login for a high-scale web scenario. In his comments, he agrees with the above stating:

“I certainly think that a similar solution using RPC would be much more complex and likely more brittle.”

I feel quite conservative in saying the most enterprise solutions fall on the right side of the intersection in the graph.

That being said, don’t underestimate the learning curve developers go through with messaging. While the mechanics are similar, the mindset is very different. Think about it like this:

You’ve driven a car for years in the US. It’s practically second nature. Then you fly to the UK, rent a car, and all of a sudden, your brain is in meltdown. (or vice versa for those going from the UK to the US)

Summary

If you are going down the messaging route, please be aware that there are shades of gray there as well. You don’t have to implement your user management and login the way I outlined in my post if you don’t require such high levels of scalability, but even lower levels of scalability can benefit from messaging.

Just as there isn’t a single correct design for non-messaging solutions, the same is true for those using messaging. Finding the right balance is tricky, and critical.

When the code is simple in every part of the system, and the asynchronous interactions are what provide for the necessary complexity the problem domain requires, that’s when you know you’ve got it just right.

Building Super-Scalable Web Systems with REST

udidahan — Mon, 29 Dec 2008 21:38:58 +0000

I’ve been consulting with a client who has a wildly successful web-based system, with well over 10 million users and looking at a tenfold growth in the near future. One of the recent features in their system was to show users their local weather and it almost maxed out their capacity. That raised certain warning flags as to the ability of their current architecture to scale to the levels that the business was taking them.

On Web 2.0 Mashups

One would think that sites like Weather.com and friends would be the first choice for implementing such a feature. Only thing is that they were strongly against being mashed-up Web 2.0 style on the client – they had enough scalability problems of their own. Interestingly enough (or not), these partners were quite happy to publish their weather data to us and let us handle the whole scalability issue.

Implementation 1.0

The current implementation was fairly straightforward – client issues a regular web service request to the GetWeather webmethod, the server uses the user’s IP address to find out their location, then use that location to find the weather for that location in the database, and return that to the user. Standard fare for most dynamic data and the way most everybody would tell you to do it.

Only thing is that it scales like a dog.

Add Some Caching

The first thing you do when you have scalability problems and the database is the bottleneck is to cache, well, that’s what everybody says (same everybody as above).

The thing is that holding all the weather of the entire globe in memory, well, takes a lot of memory. More than is reasonable. In which case, there’s a fairly decent chance that a given request can’t be served from the cache, resulting in a query to the database, an update to the cache, which bumps out something else, in short, not a very good hit rate.

Not much bang for the buck.

If you have a single datacenter, having a caching tier that stores this data is possible, but costly. If you want a highly available, business continuity supportable, multi-datacenter infrastructure, the costs add up quite a bit quicker – to the point of not being cost effective (“You need HOW much money for weather?! We’ve got dozens more features like that in the pipe!”)

What we can do is to tell the client we’re responding to that they can cache the result, but that isn’t close to being enough for us to scale.

Look at the Data, Leverage the Internet

When you find yourself in this sort of situation, there’s really only one thing to do:

In order to save on bandwidth, the most precious commodity of the internet, the various ISPs and backbone providers cache aggressively. In fact, HTTP is designed exactly for that.

If user A asks for some html page, the various intermediaries between his browser and the server hosting that page will cache that page (based on HTTP headers). When user B asks for that same page, and their request goes through one of the intermediaries that user A’s request went through, that intermediary will serve back its cached copy of the page rather than calling the hosting server.

Also, users located in the same geographic region by and large go through the same intermediaries when calling a remote site.

Leverage the Internet

The internet is the biggest, most scalable data serving infrastructure that mankind was lucky enough to have happen to it. However, in order to leverage it – you need to understand your data and how your users use it, and finally align yourself with the way the internet works.

Let’s say we have 1,000 users in London. All of them are going to have the same weather. If all these users come to our site in the period of a few hours and ask for the weather, they all are going to get the exact same data. The thing is that the response semantics of the GetWeather webmethod must prevent intermediaries from caching so that users in Dublin and Glasgow don’t get London weather (although at times I bet they’d like to).

REST Helps You Leverage the Internet

Rather than thinking of getting the weather as an operation/webmethod, we can represent the various locations weather data as explicit web resources, each with its own URI. Thus, the weather in London would be http://weather.myclient.com/UK/London.

If we were able to make our clients in London perform an HTTP GET on http://weather.myclient.com/UK/London then we could return headers in the HTTP response telling the intermediaries that they can cache the response for an hour, or however long we want.

That way, after the first user in London gets the weather from our servers, all the other 999 users will be getting the same data served to them from one of the intermediaries. Instead of getting hammered by millions of requests a day, the internet would shoulder easily 90% of that load making it much easier to scale. Thanks Al.

This isn’t a “cheap trick”. While being straight forward for something like weather, understanding the nature of your data and intelligently mapping that to a URI space is critical to building a scalable system, and reaping the benefits of REST.

What’s left?

The only thing that’s left is to get the client to know which URI to call. A simple matter, really.

When the user logs in, we perform the IP to location lookup and then write a cookie to the client with their location (UK/London). That cookie then stays with the user saving us from having to perform that IP to location lookup all the time. On subsequent logins, if the cookie is already there, we don’t do the lookup.

BTW, we also show the user “you’re in London, aren’t you?” with the link allowing the user to change their location, which we then update the cookie with and change the URI we get the weather from.

In Closing

While web services are great for getting a system up and running quickly and interoperably, scalability often suffers. Not so much as to be in your face, but after you’ve gone quite a ways and invested a fair amount of development in it, you find it standing between you and the scalability you seek.

Moving to REST is not about turning on the “make it restful” switch in your technology stack (ASP.NET MVC and WCF, I’m talking to you). Just like with databases there is no “make it go fast” switch – you really do need to understand your data, the various users access patterns, and the volatility of the data so that you can map it to the “right” resources and URIs.

If you do walk the RESTful path, you’ll find that the scalability that was once so distant is now within your grasp.

SOA, REST, and Pub/Sub

udidahan — Mon, 15 Dec 2008 08:34:24 +0000

From Integrated Simplicity:

The question of how web-based (or 3rd party) consumers can work with pub/sub based services comes up a lot.

Many developers are used to implementing web services exposing methods on them like GetAllCustomers.

When moving to pub/sub and other more loosely coupled messaging patterns, developers look to implement the same pattern, opting for something like duplex GetCustomersRequest and GetCustomersResponse. The reasoning is simple and straightforward – it is difficult to push data over the web to consumers.

However, there are still ways to disconnect the preparation of the data from its usage thus gaining many of the advantages of pub/sub.

By employing REST principles and modelling our customer list as an explicit resource, web-based consumers would simply perform regular HTTP GET operations on the URI to get the list of customers.

The resource itself could be a simple XML file – it wouldn’t need to be dynamic at all.

You can get all the scalability benefits of pub/sub for web based consumers. All you need is a bit of REST

Reliability, Availability, and Scalability

udidahan — Sat, 15 Nov 2008 21:20:20 +0000

The great people at IASA have made the recording for my webcast available online:

The slides can be found here.

I also gave this talk at TechEd Barcelona and wanted to thank the attendee who posted this comment:

“You’ve done it again. Everytime I attend a session of yours I leave the room with new insights and inspiration on how to improve my software…”

You made my day.

An Answer of Scale

udidahan — Wed, 13 Aug 2008 11:22:27 +0000

To the question of scale Ayende brings up, I thought I’d tap my concept map.

First of all, I wanted to address the relationship between various topics related to scalability:

And on the connection between scalability and throughput:

The important message here is that the scalability of a system is a cost function that gives throughput as a function of recurring costs and one time costs – servers and other hardware, and the join of buy & build:

Did you write your own locking/transaction mechanism on top of an open source distributed cache or did you buy a license for a space-based technology?

Also, don’t forget that people need to administer all the servers that you have. Those people cost money (easily100K per year). Maybe, because you haven’t invested in management or monitoring tools you need one person for every two servers. This will influence the breakdown of up front costs and recurring costs. Also, the level of availability you require will impact this as well.

In my experience, architects don’t consider often enough the operations environment in their "scalability calculations".

What this means is that there’s no such thing as technically "not being able to scale".

Rather, that the cost (up front + recurring) of supporting higher throughput grows faster than the function of revenue per user/request/whatever.

Sometimes, the solution is just to find ways to make more money per customer.

For more technical solutions, take a look at the difference between capacity and scalability and how the competing consumer pattern helps scale out.

Scalability, it’s all about the money.

—

Oh, I almost forgot, I also had a great conversation with Carl and Richard about scaling web sites that’s now up on the .NET Rocks site. Enjoy.

Scaling Long Running Web Services

udidahan — Wed, 30 Jul 2008 12:06:38 +0000

While I was at TechEd USA I had an attendee, Will, come up and ask me an interesting question about how to handle web service calls that can take a long time to complete. He has a number of these kinds of requests ranging from computationally intensive tasks to those requiring sifting through large amounts of data. What Will was having problems with was preventing too many of these resource-intensive tasks from running concurrently (causing increased memory usage, paging, and eventually the server becoming unavailable).

For comparison later, here’s a diagram showing the trivial interaction:

One solution that he’d tried was to set up the web server to throttle those requests and keep a much smaller maximum thread-pool size for that application pool. The unfortunate side effect of that solution was that clients would get “turned away” by a not-so-pleasant Connection Refused exception.

Will had been to my web scalability talk and was curious about how I was using queues behind my web services. I’ve also heard this question from people just getting started with nServiceBus when looking at the Web Services Bridge sample. Here’s the code that’s in the sample and in just a second I’ll tell you why you shouldn’t do this:

[WebMethod]

public ErrorCodes Process(Command request)

    object result = ErrorCodes.None;

    IAsyncResult sync = Global.Bus.Send(request).Register(

        delegate(IAsyncResult asyncResult)

              CompletionResult completionResult = asyncResult.AsyncState as CompletionResult;

              if (completionResult != null)

                  result = (ErrorCodes) completionResult.ErrorCode;

},

          null

);

    sync.AsyncWaitHandle.WaitOne();

    return (ErrorCodes)result;

Let me repeat, this is demo-ware. Do not use this in production.

What’s happening is that in this web service call we’re putting a message in a queue for some other process/machine to process. When that processing is complete, we’ll get a message back in our local queue (which you don’t see) which is correlated to our original request, firing off the callback. We block the web method from completing (using the WaitOne call) thus keeping the HTTP connection to the client open.

The problem here is that we’re wasting resources (the HTTP connection and the thread) while waiting for a response which, as already mentioned, can take a long time. In B2B or other server to server integration environments there are all sorts of middleware solutions that help us solve these problems, however in Will’s case browsers needed to interact with this web service. All he had was HTTP.

HTTP Solutions

Another attendee who was listening in (sorry I don’t remember your name) said that he was solving similar problems using polling but that he was having scalability problems as well.

What often surprises my clients when we deal with these same issues is that I do suggest a polling based solution, but one that still uses messaging, and this is what I described to Will:

Since we can’t actually push a message to a browser over HTTP from our server when processing is complete, the browser itself will be responsible for pulling the response. We still don’t want to leave costly resources like HTTP connections open a long time, however if the browser is going to polling for a response, we’ll need some way to correlate those following requests with the original one. What we’re going to do is use the Asynchronous Completion Token pattern, and later I’ll show how to optimize it for web server technology.

Basic Polling

When the browser calls the web service, the web service will generate a Guid, put it in the message that it sends for processing, and return that guid to the browser. When the processing of the message is complete, the result will be written to some kind of database, indexed by that guid. The browser will periodically call another web method, passing in the guid it previously received as a parameter. That web method will check the database for a response using the guid, returning null if no response is there. If the browser receives a null response, it will “sleep” a bit and then retry.

One of the problems with this solution is that polling uses up server resources – both on the web server and our DB; threads, memory, DB connections. A better solution would decrease the resource cost of the polling. Let’s use the fundamental building blocks of the web to our advantage – HTTP GET and resources:

REST-full Polling

Instead of using a guid to represent the id of the response, let’s consider the REST principle of “everything’s a resource”. That would mean that the response itself would be a resource. And since every resource has a URI, we might as well use that URI in lieu of the guid. So, instead of our web service returning a guid, let’s return a URI – something like:

http://www.acme.com/responses/88ec5359-a5d8-4491-a570-3bfe469f3a64.xml

As you can see, the guid is still there. So, what’s different?

What’s different is that instead of having the processing code write the response to the database, it writes it to a resource. This can be done by writing some XML to a file on the SAN in the case of a webfarm. Also, the browser wouldn’t need to call a web service to get the response, it would just do an HTTP GET on the URI. If the it gets an HTTP 404, it would sleep and retry as before. The reason that the SAN is needed is that, as the browser polls, it may have its requests arrive at various web servers so the response needs to be accessible from any one of them.

Just as an aside, it would be better to free the processing node as quickly as possible and have something else write the response to the SAN. That would be done simply by sending a message from the processing node that would be handled by a different node that all it did was write responses to disk.

The reason that the URI makes a difference is that serving “static” resources is something that web servers do extremely efficiently without requiring any managed resources (like ASP.NET threads). That’s a big deal.

We’re still using HTTP connections for the polling but that’s something whose effect can be mitigated to a certain degree.

Timed REST-full Polling

Since various requests can take varying amounts of time to process, it’s difficult to know at what rate the browser should poll. So, why don’t we have the web service tell it. As a part of the response to the original web service call, instead of just returning a URI, we could also return the polling interval – 1 second, 5 seconds, whatever is appropriate for the type of request. This value could easily be configurable [RequestType, PollingInterval].

An even more advanced solution would allow you to change these values dynamically. The advantage that would be gained would be that your operations team could better manage the load on your servers. When a large number of users are hitting your system, you could decrease the rate at which your servers would be polled, thus leaving more HTTP connections for other users.

Scaling and Adaptive Polling

You’d probably also want to scale out the number of processing nodes behind your queue. The nice thing is that you could change the polling interval as you scale the various processing nodes per request type providing better responsiveness for the more critical requests. Once we add virtualization, things get really fun:

We had separate queues per request type, so that we could easily see the load we were under for each type of request. That way, we could scale out the processing nodes per request type as well as change the polling interval. By virtualizing our processing nodes, and writing scripts to monitor queue sizes, we had those scripts automatically provisioning (and de-provisioning) nodes as well as changing the polling interval of the browsers.

This had the enormous benefit of the system automatically shifting resources to provide the appropriate relative allocation for the current load as its macroscopic make-up changed.

Summary

Will was well-pleased with the solution which, although more complicated than what he had originally tried, was flexible enough to meet his needs. As opposed to pure server-based solutions, here we make more use of the browser (writing our own Javascript) instead of putting our faith in some Ajax-y library. That’s not to say that you couldn’t wrap this up into a library – in essence, it is a kind of messaging transport for browser to server communication allowing duplex conversations.

In fact, what could be done is to return multiple responses to the browser over a long period of time. In the response that comes back to the browser could be an additional URI where the next response will be. This can be used for reporting the status of a long running process, paging results, and in many other scenarios.

And, one parting thought, could this not be used for all browser to web service communication?

Durable Messaging Dilemmas

udidahan — Thu, 17 Jul 2008 22:18:47 +0000

I’ve received some great feedback on my MSDN article and some really great questions that I think more people are wondering about, so I think I’ll try to do a post per question and see how that goes.

Libor asks:

“Would you recommend using durable messaging for systems where there are similar requirements with respect to data reliability as you had – ie. not losing any messages? If so, then why didn’t the final version of your solution use it? If not, can you explain why?”

The answer is, as always, it depends, but here’s on what it depends:

When designing a system, we need to take a good, hard look at how we manage state, and what properties that state has. In a system of reasonable size we can expect various families of state with respect to their business value, data volatility, and fault-tolerance window. Each family needs to be treated differently. While durable messaging may be suitable for one, it may be overkill or underkill for another.

So, here’s what we’re going to be looking at:

Business Value
Data Volatility
Fault-Tolerance Window

Business Value

When talking about business value, I want to talk about what it means “not losing any messages”. The question is under what conditions will the messages not be lost, or rather, what are the threshold conditions where messages may start getting lost. If all our datacenters are nuked, we will lose data. It’s likely the business is OK with that (as much as can be expected under those circumstances). If a single server goes down, it’s likely the business would not be OK with losing messages containing financial data. However if a message requesting the health of a server were to get lost under those same conditions, that would probably be alright. In other words, what does that message represent in business terms.

Data Volatility

Data volatility also has an impact. Let’s say that we’re building a financial trading system. The time that it takes us to respond to an event (message) that the cost of a certain financial instrument has changed, and the message that we send requesting to buy that security is critical. Let’s say that has to be done in under 10ms. Now, some failure has occurred preventing our message from reaching its destination for 20ms. What should we do with that message? Should we keep it around, making sure it doesn’t get lost? Not in this domain. On the contrary, that message should be thrown away as its “business lifetime” has been exceeded. Furthermore, even during that original period of 10ms, the use of durable messaging may make it close to impossible to maintain our response times.

Fault-Tolerance Window

These two topics feed into the third and more architectural one – fault-tolerance window: what period of time do we require fault tolerance, and with respect to how many (and what kind of) faults? This will lead us into an analysis of to how many machines do we need to copy a message before we release the calling thread. We’d also look at in which datacenters those machines reside. This will also impact (or be impacted by) the kinds of links we have to these datacenters if we want to maintain response times. These numbers will need to change when the system identifies a disaster – degrading itself to a lower level of fault-tolerance after a hurricane knocks out a datacenter, and returning to normal once it comes back up.

Re-Evaluating Durable Messaging

Durable messaging may be used at various points in each part of the solution, but we need to look at message size, the rate those messages are being written to disk, how fast the disk is, how much available disk we have (so we don’t make things worse in the case of degraded service), etc. Companies like Amazon also take into account disk failure rates, replacement rates (disks aren’t replaced immediately you know), and many other factors when making these decisions

Summary

Our job as architects when designing the system is to find that cost-benefit balance for the various parts of the system according to these very applicative parameters. No, it’s not easy. No, cloud computing will not magically solve all of this for us. But, we are getting more technical tools to work with, operations staff is getting better at working with us in the design phase, and our thought processes more rigorous in dealing with the scary conditions of the real world.

To your question, Libor, as to why we didn’t eventually use durable messaging in our solution, the answer is that we solved the overall state management problem by setting up an applicative protocol with our partners which was resilient in the face of faults by using idempotent messages that could be resent as many times as necessary. You can read more about it here. This solution isn’t viable for other kinds of interactions but was just what we needed to get the job done.

Hope that helps.

Make WCF and WF as Scalable and Robust as NServiceBus

udidahan — Mon, 30 Jun 2008 14:47:08 +0000

This topic is getting more play as more people are using WCF and WF in real-world scenarios, so I thought I’d pull the things that I’ve been watching in this space together:

Reliability

Locking in SqlWorkflowPersistenceService (via Ron Jacobs) where, if you want predictable persistence (MS: ‘none of our customers asked for this to be easy’), you need to use a custom activity (which Ron was kind enough to supply).

“Given what I learned today I’d have to say that I’d be very careful about using workflows with an optimistic locking. Detecting these types of situations is not that simple.”

Let’s think about that. If we’re doing pessimistic locking, we get into the problem of, if a host restarts (as the result of a critical windows patch or some other unexpected occurrence), that the workflow won’t be able to be handled by any other host in the meantime (you didn’t care so much about your SLA, did you?).

Luckily, someone’s come up with a hack that works around this robustness problem in Scalable Workflow Persistence and Ownership.

“So this code will attempt to load workflow instances with expired locks every second. Is it a hack? Yes. But without one of two things in the SqlWorkflowPersistenceService its the sort of code you have to write to pick up unlocked workflow instances robustly.”

This will seriously churn the table used to store your workflows, decreasing performance of workflows that haven’t timed out. Oh well.

Testability

Implementing WCF Services without Referencing WCF (via Mark Seemann):

“More than a year ago, I wrote my first post on unit testing WCF services. One of my points back then was that you have to be careful that the service implementation doesn’t use any of the services provided by the WCF runtime environment (if you want to keep the service testable). As soon as you invoke something like OperationContext.Current, your code is not going to work in a unit testing scenario, but only when hosted by WCF.”

After pointing out some of the more basic difficulties in testability a straightforward WCF implementation brings, Mark turns the heat up in his follow-up post, Modifying Behavior of WCF-Free Service Implementations:

“Perhaps you need to control the service’s ConcurrencyMode, or perhaps you need to set UseSynchronizationContext. These options are typically controlled by the ServiceBehaviorAttribute. You may also want to provide an IInstanceProvider via a custom attribute that implements IContractBehavior. However, you can’t set these attributes on the service implementation itself, since it mustn’t have a reference to System.ServiceModel.”

Wow – all the things required to make a WCF service scalable and thread-safe make it difficult to test. In the end, we’re beginning to see how many hoops we have to go through in order to get separation of concerns, but until we can take all this and get it out of our application code, it’s an untenable solution. I hope Mark will continue with this series, if only so I can take the framework that might grow out of it and use it as a generic WCF transport for NServiceBus.

Comparison

After the Neuron-NServiceBus comparison that Sam and I had, we talked some more. After going through some of the rational and thinking, Sam even put nServiceBus into his WCF-Neuron comparison talk. Sam had this to say about nServiceBus:

“The bottom line is: I like what I see. Although it’s a framework, not an ESB product like Neuron, it’s a powerful framework that takes the right approach on SOA and enforces a paradigm of reliable one-way, *non-blocking* calls. That is the point of the talk tonight overall; we need to get away from the stack world of synchronous RPC calls to true asynchronous non-blocking message based SOA systems.”

The main concern I have with a WCF+WF based solution is that developers need to know a lot in order to make it testable, scalable, and robust. In nServiceBus, that’s baked into the design. It would be extremely difficult for a developer writing application logic to interfere with when persistence needs to happen, or the concurrency strategy of long-running workflows. The fact that message handlers in the service layer don’t need concurrency modes, instance providers, or any of that junk make them testable by default.

Object Relational Mapping Sucks!

udidahan — Wed, 25 Jun 2008 11:32:06 +0000

For reporting, that is.

And doesn’t handle concurrency!

Unless you don’t expose setters.

I guess it depends, doesn’t it?

Well, that was Ted’s assertion in his recent Pragmatic Architecture column on data access.

But, “it depends” doesn’t get the system built, does it?

So, here are some rules for using o/r mapping that will get you 99% of the way there.

Yes, you heard me.

Rules.

They do not depend.

If you’re doing something significantly bigger than enterprise-scale development, and you are already doing this, and it isn’t enough, give me a call. Here we go.

No reporting.

I mean it. Don’t report off of live data.
This isn’t just a o/r mapping thing.
Users can tolerate some, if not quite a lot of latency.

And it’s not like objects are even used. It’s just rolled up data. Not a single behaviour for miles.
Don’t expose setters

You want multiple users sharing and collaborating on data, right? Then don’t force them to either overwrite each others data, or throw away their own. There is one simple way to avoid that: Get an object, call a method. Once the object has the most up to date data, pass all the client data in via a method call. The object will decide if its valid, from a business perspective as well, and then update the appropriate fields.

Now your DBAs can vertically partition tables accordingly, and improve throughput. After that, you can increase the isolation level, to improve safety, without hurting throughput.

This will also keep your logic encapsulated, bringing you closer to a true Domain Model.

If your O/R mapping tool requires you to have setters on your domain classes, hide those from your service layer behind an interface.
Grids are like reports.

No o/r mapping required there either. While you probably won’t be showing grids of yesterday’s data to users in an interactive environment, it’s still just data – no behaviour.

However, users should NOT update data in those grids. This gets back to rule 2. Have users select a specific task they want to perform, pop open a window, and have them do it there. Change customer address. Discount order. You get the picture. That way you’ll know what method to call on those objects you designed in rule 2.

Before wrapping up, one small thing.

You can use an O/R mapping tool to do reporting, just, for the love of Bill, don’t use the same classes you designed for your OLTP domain model. But, just because you can, doesn’t necessarily mean you should. ~~Datasets~~ datatables are probably just as viable a solution.

Sagas Solve Stupid Transaction Timeouts

udidahan — Mon, 23 Jun 2008 07:09:31 +0000

It turns out that there was a subtle, yet dangerous problem in the use of System.Transactions – a transaction could timeout, rollback, and the connection bound to that transaction could still change data in the database.

Think about that a second.

Scary, isn’t it?

At TechEd Israel I had a discussion with Manu on this very issue, just under a different hat:

What’s the difference between a short-running workflow and a long-running one?

Manu suggested that we look at the actual time that things ran to differentiate between them. I asserted that if any external communication was involved in some part of state-management logic, that logic should automatically be treated as long-running.

Manu’s reasoning was that the complexity involved in writing long-running workflows was not justified for things that ran quickly, even if there was communication involved. Many developers don’t think twice about synchronously calling some web services in the middle of their database transaction logic. In the many Microsoft presentations I’ve been at on WF, not once has it been mentioned that state machines should be used when external communication is involved.

The problem that I have with this guidance is how do you know how quickly a remote call will return?

Do you just run it all locally on your machine, measure, and if it doesn’t take more than a second or so, then you’re OK?

The fact of the matter is that we can never know what the response time of a remote call will be. Maybe the remote machine is down. Maybe the remote process is down. Maybe someone changed the firewall settings and now we’re doing 10KB/s instead of 10MB/s. Maybe the local service is down and we’re communicating with the backup on the other side of the Pacific Ocean.

But the thing is, Manu’s right.

Writing long-running workflows (with WF) is more complex than is justified. My guess is that since WF wasn’t specifically designed for long-running workflows only, that this complexity crept in.

Sagas in nServiceBus were specifically designed for long-running workflows only.

Maybe that’s what kept them simple.

Since all external communication is done via one-way, non-blocking messaging only, each step of a saga runs as quick as if no communication were done at all. This keeps the time the transaction in charge of handling a message is open as short as possible. That, in turn, leads to the database being able to support more concurrent users.

In short, sagas are both more scalable and more robust.

No need to worry about garbaging-up your database.

[Podcast] Highly Scalable Web Architectures

udidahan — Thu, 19 Jun 2008 20:42:34 +0000

For those people who couldn’t come to TechEd USA and didn’t see my talks on how to build highly scalable web architectures, you’re in luck – Craig, the man behind the Polymorphic Podcast sat down with me and we chatted about what the problems, common solutions, and effective tactics there are in this space. For those of you who were at TechEd and still didn’t come to my talk – what were you thinking?!

Check it out.

Some of this stuff is a bit counter-intuitive (and not readily supported by the tools available in Visual Studio) so please, do feel free to ask questions (in the comments below).

NServiceBus Performance

udidahan — Wed, 21 May 2008 07:08:05 +0000

I’ve gotten this question several times already but now companies are beginning to look for performance comparisons in making decisions around the use of nServiceBus. It’s often compared to straight WCF, BizTalk, and now Neuron ESB. In Sam’s recent post he posts to a case study of Neuron doing 28 million messages an hour. That’s far more than I’ve ever heard quoted for BizTalk.

Disclaimer

Before giving some numbers, please keep in mind that high performance of system infrastructure does not necessarily by itself mean that the system above it is running that fast. For instance, you may have server heartbeats running really quickly but the time it takes to save a purchase order borders on a minute. So, please, take all benchmarks with a grain of salt, or two, or a whole shaker-full.

While I’m not at liberty to say on which specific domain/company these numbers were measured, I can say that we had the full gamut of “stateless services”, statefull services (sagas), number crunching, large data sets, many users, complex visualization, etc. Also, this wasn’t the largest installation of nServiceBus that I’m aware of, but its the one I have the most specific numbers for.

Setup

OK, so using the default nServiceBus distribution using MSMQ, on servers where the queue files themselves were on separate SCSI RAID disks, we were pumping around 1000 durable, transactionally processed messages per second, per server. That means that similar to the Neuron case, no messages would be lost in the case of a single fault per server per window (time to replace a failed disk set at 3 hours from failure, through detection, to replacement per site – but that’s more an operational staffing concern, not the technology itself).

So, that’s 3.6 million messages per hour per server, at full load. We had a total of 98 servers doing these kinds of processing, not including web servers, databases, etc. Keep in mind that web servers would be communicating with other servers using nServiceBus, but that would maybe be an unfair comparison to the Neuron numbers.

Server Breakdown

Anyway, the 48 number crunching servers (blade centers) we had were at full load, so we were pumping more than 170 million messages there. Keep in mind that those servers had a really fast backbone so weren’t held up by IO. Your environment may be different.

Another 30 (regular pizza boxes) were doing our sagas. Saga state was stored in a distributed in-memory “cache”, so once again IO wasn’t an issue for processing those messages. We were at about 70% utilization there, coming to just over 100 million messages an hour.

The last 20 were clustered boxes (fairly expensive) that handled the various nServiceBus distributor and timeout manager processes were at full load since they handled control messages for all the servers as well as dynamically routing the load. However, on those boxes we used much higher performance disks for the messages, since they had to feed everything else, capable of doing, on average, around 5000 messages a second. That adds up to 360 million messages an hour.

Unnecessary Durability

Later, we moved a bunch of messages that didn’t need all that durability and transactionality off the disks, pushing the total throughput over 1 billion messages an hour. That was about 100 million per hour durable, 900 million per hour non-durable. You can guess that we were left with plenty of IO to spare at that point while we weren’t yet pushing the limit of our memory.

One thing that’s important to understand is the size of the messages that didn’t require durability was less than 1MB, with most weighing in under 10KB. Also, since most of those messages were published, less state management was required around them, enabling us to further improve performance.

Summary

NServiceBus didn’t give us all that by itself. It was the result of skilled architects, developers, and operations staff working together for many iterations, deploying, monitoring, re-designing, etc. You need to understand your technology, your hardware, and your specific performance, availability, and fault-tolerance requirements if you want to get anywhere.

There’s no magic.

I didn’t see the number or kinds of servers involved in the Neuron case study so this wasn’t ever really a comparison. Nor or we talking about the same system here.

So, please, don’t base your decisions on arbitrary numbers. Spend some time setting up a scaled down version of your target architecture with all the relevant technologies and measure. Be aware that you want high performance end to end, not just of the messaging part. At times, it makes sense to actively throw away messages (of the non-durable, published kind) to help a server come online faster especially after a restart.

Thus ends the tale of another “benchmark”.

[Video] Messaging and Architecture Discussion at ALT.NET

udidahan — Mon, 28 Apr 2008 21:53:00 +0000

In this video, Greg Young, Martin Fowler, Evan Hoff, Dru Sellers, myself and some others discussed various aspects of event-based systems, how Domain-Driven Design works with them, what role messaging has, and how all these connect to architectural properties like scalability and fault tolerance.

One of the questions that Martin started answering was how teams can start getting into the messaging state-of-mind. Unfortunately, the conversation veered off into what kind of messaging interactions are appropriate leaving the original question unanswered.

I’m hoping to address this topic with some of the information I’m putting up on the nServiceBus site. There’s always Gregor and Bobby’s excellent EIP book that I think is a must for anybody writing distributed systems.

Enjoy.

Scalability Article up on InfoQ

udidahan — Fri, 11 Apr 2008 05:59:38 +0000

I’ve published a new article on performance and scalability on InfoQ:

Spectacular Scalability with Smart Service Contracts

In this article, I attempt to debunk some of the myths around stateless-ness as the key to scalability.

Here’s how it starts:

It was a sunny day in June 2005 and our spirits were high as we watched the new ordering system we’d worked on for the past 2 years go live in our production environment. Our partners began sending us orders and our monitoring system showed us that everything looked good. After an hour or so, our COO sent out an email to our strategic partners letting them know that they should send their orders to the new system. 5 minutes later, one server went down. A minute after that, 2 more went down. Partners started calling in. We knew that we wouldn’t be seeing any of that sun for a while.

The system that was supposed to increase the profitability of orders from strategic partners crumbled. The then seething COO emailed the strategic partners again, this time to ask them to return to the old system. The weird thing was that although we had servers to spare, just a few orders from a strategic customer could bring a server to its knees. The system could scale to large numbers of regular partners, but couldn’t handle even a few strategic partners.

This is the story of what we did wrong, what we did to fix it, and how it all worked out.

Continue reading…

Distributed Architecture on ARCast.TV Rapid Response

udidahan — Mon, 14 Jan 2008 23:45:34 +0000

A while ago, me and Ron Jacobs (virtually) got together and did a couple “rapid responses” to questions on the MSDN architecture forums, and I just noticed that they’re online. The really great thing is that there are transcripts! For your convenience, I’ve included them here.

By the way, if you’re looking for more Q&A style info, check out the Ask Udi podcast. If you have a pressing question and need a shorter turn around time than the month or so it usually takes me for the podcast, send me an email to OnlineConsultation@UdiDahan.com.

Number 1

Ron: Hey, welcome back to ARCast Rapid Response. This is your host Ron Jacobs and today I’m looking at the MSDN architecture forum where I see this message from “theking2.” Yeah? OK, so “king,” he says, he’s building a distributed architecture that has a number of external systems. These external systems interface through a telnet connection and so they accept commands and return results as ACKS or NACKS.

Typically these systems have limited resources for the number of simultaneous sessions you can open, so, five to fifty depending on the system. What he did to get around this, was, he created some Enterprise Services objects and some pooled objects that set up these connections and then he has some Web services. The Web services are going to receive an incoming message. They’re going to call these pooled COM+ objects and they’re going to make the telnet calls to the external systems. Sounds interesting.

He says, after a year of production it has become apparent that some of the external systems are not performing very well. He says the bulk of the requests, but not all, to the external systems can be done asynchronously. So, he’s opting for a message queue-based solution using pseudosynchronous calls whenever a direct response is needed.

So, the question is, at what layer would message queuing make most sense?

So, should the clients, this Web service that receives the message — should it do a queue? Put a message in the queue and then the COM+ objects would pop off or they have some central Web services that would pop it. So, the central Web services or these Enterprise Service objects? Or maybe just a communication at the top of the telnet. He says this is the first time when he’s using message queuing.

On the line with me I have Udi Dahan, the Software Simplist from Israel.

Udi, this is a very interesting application and my first gut reaction is, does it really matter where you put the queuing?

Udi Dahan: Well, actually I took a look at it as well and I’d have to say that it does because the problem that he’s trying to solve isn’t that clear. We know that there is some sort of performance problem but we’re not quite sure where it is. We know that there are long and varying latencies in the responses but we’re not really quite sure why.

While we know that their external system is a bit slow but our choice of where to put the queue will probably have an impact, obviously on the development model of the clients and the Web services as well as how those external systems would work. So, I’d have to say that choosing the correct place to put the queue is important.

Ron: Well, let me interject something here because what you said just made me think. Now, if the problem is that these external systems are slow and limited number of connections, the first question we ought to ask is, does queuing help this situation at all?

Udi: Well, that’s probably a good first step. I mean every single time someone comes with a solution and then says, “OK, what’s the problem,” it’s always a good thing to check that solution first.

It looks like the problem that he has here has to deal with or the reason that he wants to use a queue is to do some kind of load leveling. He’s getting too many requests or at too high a rate from his clients and external Web services and external web applications more than his back end systems can use. So, using a queue as a load leveling mechanism is definitely the right way to go. So, from that perspective I think that putting a queue somewhere in there is a good idea.

Ron: OK. So then if you put a queue, it seems to me that it’s not going to make that much difference which layer you put the queue, would it?

Udi: Well, it might for the main reason that you really have to look at where his bottleneck is and that’s his back end systems. The bottleneck also has to do with the number of connections that can be opened and the number of sessions that can be opened. The place that I’d be looking at doing that is probably between those pooled COM+ objects and his central Web services for the main reason that that really gives a nice encapsulation in terms of the Web services towards both his organization’s internal services if they are other Web services, web applications or clients and everybody else that’s going on out there while keeping that abstraction out of the way.

So, the choice of using pooled COM objects is one of the ways he does the load leveling now. One of the problems he has is that it doesn’t seem to be doing that much for him because the switches and knobs that are available in COM+ in order to do that load leveling aren’t that great. What I’d be looking at in his situation is to put a queue in there but on the back side of that queue, not talking directly to the external system but doing something with WCF.

WCF has an incredible amount of switches and knobs in order to do the load throttling and the number of threads that are open. He could also do that on a large number of URIs in order to sort of split up the load from that perspective allowing him to cache results quite a bit better. So, that’s where I’d be looking at too. Just throw away those COM+ objects, put WCF in there, use the MSMQ binding and start configuring things from there.

Ron: There’s a lot of stuff in the message, but I think his core concern is performance. He mentions pseudosynchronous calls. I think by that he means, a message comes in to the web service, he’s going to drop something on the queue and then hold that message response until he gets a response back from a queue. So, it’s sort of synchronous but sort of not synchronous. So, in effect he’s kind of waiting on a queue instead of waiting on the pooled object to make this outbound telnet call.

I could agree if you said, “Well, look, our big problem is that we keep getting time outs because when we go to get a COM+ object from the pool, COM+ waits for a while and then it says, “Hey, there’s no object available’ and it returns an error,” then the queue is definitely going to help that problem. But in terms of the sheer through put or performance of the system, this is not going to help at all. It’s going to still be the same performance.

Now if you said, “Oh look, we can do some of this work kind of at a later point in time, ” well queuing doesn’t allow you to time shift the work. Right? So, if you said, “Look we can rethink this solution.” So you get a message in, we stuff something in a queue that we’ll deal with later, and then very quickly return a response like some kind of a number like, hey “your transaction number blah, blah, blah, will be processed later, it’s queued for processing, ” whatever.

I mean that introduces a lot of complexity in the system but it clearly would provide better response at the Web service layer. What do you think?

Udi: Well I think that at the most basic level, his throughput is dictated by his back end systems. From what he seems to be describing, every single request that is going through there, has to hit that back end system. If he has a limited number of back end systems that are supporting a limited number of connections, that’s going to limit his throughput no matter what technology he puts in front of that. So that’s at the core level. You just can’t get away from that.

The one thing that I would agree with you in your description there is the choice of using those COM+ objects. I mean COM+ was a great technology when it came out. The problem occurs, of course, when we start getting into larger and larger delays around the response time and we start getting all sorts of time out exceptions and things like that. So in that respect, I definitely say you know, take a step back from there.

But in terms of everything that he has around there, the queue isn’t going to make the back end system run any faster. What it will do is definitely complicate his system because he’s taking something that used to be synchronous and making it asynchronous. Writing Web services in order to handle that, I mean just adding a bunch of threads in order to listen to queues is not going to make things any simpler.

However, what it might do is to improve the resource usage of those Web services, OK? So instead of having those Web services have a bunch of threads open, waiting for the response coming back from those COM+ pooled objects, those threads could be relinquished and really just be triggered back up when a response comes back from the queue.

So I don’t see an improvement in the kind of solution that MSMQ or queuing would put in there in terms of the latency — how long it would take for a response to get back. However, I do see an improvement in terms of the resource usage of all the other players in the system.

Ron: I would agree with that. I would just say though that if you make the Web server that is hosting these Web services more resource efficient, maybe all you’re going to do is enable it to get more requests in queue the more quickly. Ultimately, this solution I think is going to solve a lot of problems related to time outs and server busy errors and that sort of thing, thread contentions, but not likely to increase overall performance.

But I definitely agree though. I would move this solution forward to WCF. I used to be on the COM+ team. COM+ was rolled into WCF so that it would have similar capabilities for pooling, instancing behavior, transactional support, those sorts of things. I would definitely move that forward into WCF.

OK! So great answer, Udi. Thank you so much for being on this ARCast Rapid Response.

Number 2

Ron: Hey this is Ron Jacobs back with another ARCast.TV Rapid Response. Today I’m joined by Udi Dahan, the Software Simplist from Israel.

Udi, I’m looking at the MSDN Architecture Forum and here’s a question from “blast.” Blast says he’s looking for where to put business rules. He’s developing a WinForm application. He uses data sets as the data layer, he says. He’s thinking about business rules and where to put them.

He says obviously, the more organized and centralized business rules are, the better. He’s tempted to put the business rules in the UI layer especially with the type data set. It makes a lot of sense there but not all rules belong on the client. He says some rules belong on the server, perhaps in a trigger.

So he’s asking where do you put your rules? How do you think about this problem, Udi?

Udi: Well, it looks like what he’s doing here is developing a two-tier client that is using WinForms and using datasets and speaking directly to the database. That in essence is part of his problem in that in terms of performance, he’d like to run more rules in the UI layer so that the user won’t be sending garbage to the database.

He also understands that because he’s building a multi-user system, there is a limited capability, in terms of concurrency, of actually having all the rules run correctly in order to make sure that everything is correct. So, his choice of an architecture, working two-tier is the main problem of why he has to fragment his business rules.

If he were to move towards a three-tier solution, that is put an application server between his smart client and the database, it would be a lot easier to put those business rules there. Now, once the business rules are out of the database, because again, we don’t have to deal with the concurrency issues once we have an application server and we’re using transactions there and we don’t have any disconnected problems, then what we can do is use those same DLLs, that same CLR code that runs the business rules, and deploy it client-side and use it there.

So, in terms of deployment, what we’d have is we’d have the same rules, both running client-side and server-side, whereas from a development perspective, we’d have them organized and centralized. That’s the way that I’d go about it.

Ron: Yeah, you know, I think conceptually I agree with you that a multi-tier solution would be a very good idea here. What I would probably think about conceptually, is breaking down rules into things that really ought to happen on the client-side. In particular, rules related to validation of data, so you know that you’ve got good and complete data before you ship it off to the server-side. Oftentimes you have to do that anyway because you have a button that shouldn’t be enabled until the data is valid, or something like that.

Udi: Absolutely.

Ron: Of course, we all know that if you have middle-tier web services, you must do validation both on the client-side and the server-side, because you must ensure that the valid data is received on the web server. So I agree with you that creating an assembly that you deploy on both sides is a good idea.

I would just expand on what you said a little bit and think about maybe on the server-side using a workflow foundation and business rules and workflow as a way to handle a lot of the heavier lifting, server-side validations and business rules that might require maybe sifting through more data or whatever kind of things, but server-side business rules that are more oriented towards business logic, and even if you have very, very data-intensive roles, then maybe some of those might even happen in the database. Don’t you agree?

Udi: Oh, absolutely. Absolutely. That’s something that I think often gets swept under the rug too much. Things like unique constraints and things like that are kinds of business rules. They protect the referential integrity and if we look at the alternatives, sometimes getting 10 million rows out of the database, in order to do some sort of unique email validation upon them, that’s just going to kill your performance.

There are certain things that it just makes sense to do them in the database, it’s just the best way to do it. The hard part, from a development perspective, is maintaining the coherence of your business rules. When you say, “OK, I want a single perspective, what are all the rules in my system?”

Even though we might try to keep it all CLR based, some of the things like unique constraints, like referential integrity, will be in the database. So, what I sometimes suggest to do is to have a separate solution, in terms of your development team, where you put all your business rules.

This includes both the SQL statements for defining your unique constraints and your referential integrity. Also put in that validation logic, your workflow that you’re going to be running server-side. If it’s AJAX controls and regular expressions that you’re going to be doing client-side in order to validate that data, absolutely make sure you have, from a development perspective, one place where you can go where you can see everything, because if you don’t do that [inaudible] can be running, and when things stop working, you won’t know how to debug it.

Ron: All right. Well, excellent answer. Udi, thank you so much.