Reliability – Udi Dahan – The Software Simplist

UI Composition vs. Server-side Orchestration

udidahan — Mon, 09 Jul 2012 06:49:26 +0000

Following on my last post called UI composition techniques for correct service boundaries, one commentor didn’t seem to like the approach I described saying:

“I’m sorry, but with all due respect I must strongly disagree. You haven’t avoided any orchestration work at all, you’ve just moved it in to client side script!

How are you going to deal with the scenario that one of the service calls fails? Say a failed credit card payment, or no more rooms left? In more javascript??

I would much rather take the less brittle approach of introducing an orchestration service. Like it or not, however trivial it may be, there is a relationship between these services, if one call fails, they both fail. This should be reflected in the architecture, not hidden in javascript. With an orchestration service you also either get transactions for free provided by infrastructure, or alternatively if the underlying service doesnt support this, explicit and unit testable control over recovery.”

Since this is a common point of view, I thought I’d take the time to explain a bit more.

Let’s start at a fairly high level.

On failures

I’ve talked many times in the past about how to handle technical causes for failure like server crashes, database deadlocks, and even deserialization exceptions. Messaging and queuing solutions like NServiceBus can help overcome these issues such that things don’t actually fail – they just take a little longer to succeed.

On the logical side of things, the CQRS patterns I talk about describe an approach where aggressive client-side validation is done to prevent almost all logical causes for failure. The only thing that can’t be mitigated client-side are race conditions resulting in actions taken by other users at the same time.

In short, it really is uncommon for things to fail when being processed server-side.

Back to the specific example

The concerns raised in the comment specifically talked about a failed credit card payment or no rooms left in the hotel, so let’s start with the credit card thing:

In my last post I talked about collecting guest and credit card information from the user as a part of the “checkout” process when making a reservation for a hotel room. Just to be clear – there is a final “confirm your reservation” step that happens after all information has been collected.

What this means is that we aren’t actually charging the customer’s card when we collect that data, therefore there is no real issue with a failed credit card payment that needs to be handled by the client-side javascript. When the customer confirms their reservation, yes, there might be a failure when charging the card though there are only some specific types of rates for which the hotel charges your card when you make a reservation.

In general, failed credit card payments are handled pretty much the same way for all ecommerce – an email is sent to the customer asking for an alternative form of payment, also saying that their purchase won’t be processed until payment is made.

In any case, it is only after the reservation is placed that the responsible service would publish an event about that. The service which collected the credit card information would be subscribed to that event and initiate the charge of the card when that event arrives (or not, depending on the rate rules mentioned).

With regards to there not being any rooms left, well, first of all, there’s overbooking – hotels accept more reservations than rooms available because they know that customers sometimes need to cancel, and some just don’t show up. Secondly, there is a manual compensation process if more people show up than there are actual rooms to put them in. In some cases, a hotel will bump you up to a higher class of room (assuming there aren’t too many reservations for those), and in others they will call a “partner” hotel nearby and put you up there instead.

In summary

While arguments can be made that yes, these issues have been addressed in this specific example, there may be other domains where it is not possible to do these kinds of “tricks”. Although I do agree with that in theory, I’ve spent the better part of 5 years travelling around the world talking to hundreds of people in quite a few business domains, and every single time I’ve found it possible to apply these principles.

In short, the use of UI composition allows services to collect their own data, making it so anything outside that service doesn’t depend on those data structures which makes both development and versioning much easier. Technical failure conditions can be mitigated at infrastructure levels in most cases and other business logic concerns can be addressed asynchronously with respect to the data collection.

Give it a try.

Logically Distributed, Physically Centralized

udidahan — Sun, 06 May 2012 22:56:45 +0000

When people pull back the covers on something like MSMQ, particularly its private queues (the way NServiceBus uses it), and they see that MSMQ is storing its messages in C:WindowsSystem32, well, they’re not particularly happy.

One of the reasons they worry about these types of distributed or federated queue-based solutions has to do with physical failures. The concern is that messages would be lost if there was a hard drive failure.

The preference for centralized message broker type solutions is that we can set it up on a nice RAID infrastructure that will take care of any physical reliability concerns. (Just so that we’re clear here – I’m talking about an single datacenter, possibly connected to a disaster recovery site.)

So, here’s the thing:

Virtualization

You see, in a virtualized production environment, the C drive of a virtual machine is physically in the image file of that VM, which is sitting on a SAN (storage area network).

What that means is that when a message is sent from one processing node to another, the data of that message ends up being written to the SAN, with all of its redundant disks. Even if one of the machines has a critical failure and cannot start up again, all the VMs that were running on it can be started up on a different machine without any message loss.

In fact, most virtualized environments have monitoring and management capabilities built-in so that the VMs will be brought up automatically on another machine. Even if you aren’t using messaging, there are so many other benefits that virtualization brings that you probably should be planning on putting it in, if you haven’t already.

Databases too

In fact, many people do the same thing with databases.
The file partitions on which the database server actually stores its data are on the SAN.

Think about that for a second.

All the data in messages flowing through your queues, and the data in the database, on a SAN. This gives you the ability to do a fully consistent backup of the entire system with SAN snapshots, not to mention ship those to your disaster recovery site.

In closing

Distributed solutions are often misunderstood.
Bad experiences in the past with MSMQ can color perceptions in the present.

The thing is that today’s infrastructure is set up to handle distributed solutions much better.
Developers no longer have to turn to centralized broker or database technologies to get the centralized backup and restore capabilities administrators look for.

If you’ve been avoiding NServiceBus for these reasons, give it a try. Not only will it make your life as a developer easier, combined with this virtualization thing, it will make your administrators life easier too.

NServiceBus in Insurance – Testimonial

udidahan — Fri, 30 Mar 2012 10:26:24 +0000

Getting more companies willing to go on the record about their usage of NServiceBus.

Update: Check out full list of customers.

Here’s a testimonial from a large insurance company whose name always had me wondering…

If P & C Insurance is the leading property and casualty insurance company in the Nordic countries with 3.6 million customers and 6400 employees. NServiceBus was chosen as the ESB for one of our core insurance systems and is currently processing around 100,000 insurance policies a month. We expect that number to double within a year.

The support NServiceBus provides around asynchronous messaging and publish/subscribe communication makes the implementation of Service Oriented Architectures very straightforward. While the performance and scalability of the platform is able to handle the massive loads we’re under, we’re just as pleased with how quickly developers are able to get started with NServiceBus – even those coming into the project later on benefit from the maintainability of NServiceBus-centric code. We already have over 30 developers using it.

Looking forwards, we will be using NServiceBus on more systems at ever larger scales.

Video Online: Who Needs a Service Bus Anyway?

udidahan — Sat, 17 Mar 2012 18:24:46 +0000

My presentation at Oredev on “Who needs a service bus anyway?” is now online for your viewing pleasure. Enjoy.

The Myth Of “Infinite Scalability”

udidahan — Thu, 29 Dec 2011 09:58:52 +0000

Scalability is a topic near and dear to my heart.

Many a client seeks me out for the first time for help in this area.

Usually the request is for an amount substantially smaller than infinity.

It’s usually on the discussion groups and in conference presentations that infinity is brought into it.

The basics

The first issue with scalability is the use of the word as an adjective: scalable.

“Is the system scalable?”

Or the similar verb use: “Does it scale?”

The problem here is the implication that there is a yes/no answer to the question.

Scalability is not boolean.

Linear Scalability

When people talk about scalability, or a system being able to scale, they’re usually referring to a graph that looks something like this:

The red graph indicating a system that does not scale well, the green graph indicating one that does.

What is missing from this diagram are the labels of the axes.

The Y axis is Cost, Expense, or Money.
The X axis is usually the number of users (for internet-type companies).

Ultimately, scalability is a cost-function that will tell us how much it will cost to have the system support a certain number of users.

Linear scalability is when the cost of the next user is the same as the cost of the previous user. This means our system doesn’t have bottlenecks. This is what people usually mean when they say “infinite scalability”.

But there’s more

As many of the internet companies (and their investors) have realized over the years, there’s a difference between the number of users and the number of active users. It’s very easy to scale to a billion users when only 1000 of them are active at any given time.

To be more accurate, what we want is additional X-axes for things like total data managed by the system, number of requests per user, resource utilization per request, propagation speed (how quickly information entered by one user needs to be visible to others), and more.

Scalability is a multi-dimensional cost function, where part of an architects job is to figure out which dimensions are significant for the system/business, and what the expectation for growth is across each axis.

Preparing for “infinity”

Be careful not to optimize for only a single dimension – reality is a whole lot more complex.

There are so many other things to deal with as a system scales.

For example, do you really think you’re going to want your configuration entirely centralized? Putting everything in one place means easier management, yes, but it also means a mistake will instantly affect everyone. Is it worth the risk? Maybe instead of centralization, we could do with some automation that will allow a staged rollout of configuration changes with the ability to rollback.

The same goes for rolling out new versions, patches, and upgrades.

But that now means we may have multiple versions of the same system in production at the same time. How will that work? Will they all talk to the same database? How will we version the database then? If not, how will we handle state? Won’t this mean our code will have to be backwards compatible from one version to another? Isn’t that hard? Like, insanely hard?

Please, can we park the whole “infinite scalability” thing?
It’s really not the most important concern – not by a long shot.

Inconsistent data, poor performance, or SOA – pick one

udidahan — Sun, 18 Sep 2011 16:52:17 +0000

One of the things that surprises some developers that I talk to is that you don’t always get consistency even with end-to-end synchronous communication and a single database. This goes beyond things like isolation levels that some developers are aware of and is particularly significant in multi-user collaborative domains.

The problem

Let’s start with an image to describe the scenario:

Image 1. 3 transactions working in parallel on 3 entities

The main issue we have here is that the values transaction 2 gets for A and B are those from T0 – before either transaction 1 or 3 completed. The reason this is an issue is that these old values (usually together with some message data) are used to calculate what the new state of C should be.

Traditional optimistic concurrency techniques won’t detect any problem if we don’t touch A or B in transaction 2.

In short, systems today are causing inconsistency.

Some solutions

1. Don’t have transactions which operate on multiple entities (which probably isn’t possible for some of your most important business logic).

2. Turn on multi-version concurrency control – this is called snapshot isolation in MS Sql Server.

Yes, you need to turn it on. It’s off by default.

The good news is that this will stop the writing of inconsistent data to your database.
The bad news is that it will probably cause your system many more exceptions when going to persist.

For those of you who are using transaction messaging with automatic retrying, this will end up as “just” a performance problem (unless you follow the recommendations below). For those of you who are using regular web/wcf services (over tcp/http), you’re “cross cutting” exception management will likely end up discarding all the data submitted in those requests (but since that’s what you’re doing when you run into deadlocks this shouldn’t be news to you).

The solution to the performance issues

Eventual consistency.

Funny isn’t it – all those people who were afraid of eventual consistency got inconsistency instead.

Also, it’s not enough to just have eventual consistency (like between the command and query sides of CQRS). You need to drastically decrease the size of your entities. And the best way of doing that is to partition those entities across multiple business services (also known in DDD lingo as Bounded Contexts) each with its own database.

This is yet another reason why I say that CQRS shouldn’t be the top level architectural breakdown. Very useful within a given business service, yes – though sometimes as small as just some sagas.

Next steps

It may seem unusual that the title of this post implies that SOA is the solution, yet the content clearly states that traditional HTTP-based web services are a problem. Even REST wouldn’t change matters as it doesn’t influence how transactions are managed against a database.

The SOA solution I’m talking about here is the one I’ve spent the last several years blogging about. It’s a different style of SOA which has services stretch up to contain parts of the UI as well as down to contain parts of the database, resulting in a composite UI and multiple databases. This is a drastically different approach than much of the literature on the topic – especially Thomas Erl’s books.

Unfortunately there isn’t a book out there with all of this in it (that I’ve found), and I’m afraid that with my schedule (and family) writing a book is pretty much out of the question. Let’s face it – I’m barely finding time to blog.

The one thing I’m trying to do more of is provide training on these topics. I’ve just finished a course in London, doing another this week in Aarhus Denmark, and another next month in San Francisco (which is now sold out). The next openings this year will be in Stockholm, London; Sydney Australia and Austin Texas will be coming in January of next year. I’ll be coming over to the US more next year so if you missed San Francisco, keep an eye out.

I wish there was more I could do, but I’m only one guy.

Hmm, maybe it’s time to change that.

NServiceBus 2.5 Released

udidahan — Fri, 31 Dec 2010 16:12:44 +0000

Just before we usher in the new year, I’m happy to announce the release of NServiceBus version 2.5.

Yes, there’s a new logo, and the website’s been redesigned.
It’s been a long time coming – the previous version (2.0) was released in March.

I’m really quite excited about this version as it rolls up all the bug fixes and enhancements that customers have asked for as they ran version 2.0 under the most severe types of production environments. The next thing that is a big deal that many have been asking for is a licensed version of NServiceBus – that is, the ability to purchase a commercial license and receive support.

We all know how managers like having a throat to choke.

And now they’ll have one – NServiceBus Ltd is the company that will be providing licensing, services, and support for all customers’ NServiceBus needs. After more than 33,000 downloads and over 1000 developers in the community, the demand has really grown. Who would’ve thought all this would happen when I started NServiceBus 4 years ago (before it even had a name).

Why NServiceBus is better than WCF for your distributed systems

This question comes up repeatedly for people hearing about NServiceBus for the first time.

The answer is simple – reliability.

A system built with NServiceBus is so much more reliable to all kinds of production conditions than WCF that it’s hardly a fair comparison at all. While WCF can be configured to provide something kind of close to the same level of reliability, you need to do a fair amount of spelunking through the various options of netMsmqBinding to get it right.

The second reason to use NServiceBus instead of WCF is publish/subscribe.

The ability to make use of events and the observer pattern not just to achieve loose coupling within a single process, but across many processes, machines, and sites. Can you imagine going back to programming without events? Shudder. But that’s exactly what it’s like to use WCF in your distributed system. NServiceBus brings you the best of object-oriented programming but in a distributed and reliable infrastructure.

Don’t wait any longer

Take NServiceBus for a spin.

But things may look a bit different after you do…

http://www.NServiceBus.com

And have a happy New Year.

High Availability Presentation

udidahan — Mon, 21 Jun 2010 06:36:34 +0000

OK – this is the last one, I promise. Well, for now, anyway.

Earlier this month at TechEd North America I gave a fairly new presentation that was only delivered once before (at the Connected Systems User Group in London) and I’m happy to say is now online for your viewing pleasure.

High Availability – A Contrarian View

Comments? Thoughts? Let me know.

Lost Notifications? No Problem.

udidahan — Sun, 07 Dec 2008 09:46:05 +0000

One of the most common questions I get on the topic of pub/sub messaging is what happens if a notification is lost. Interestingly enough, there are some who almost entirely write-off this pattern because of this issue, preferring the control of request/response-exception. So, what should be done about lost messages? The short answer is durable messaging. The long answer is design.

Durable Messaging

In order to prevent a message from being lost when it is sent from a publisher to a subscriber, the message is written to disk on the publisher side, and then forwarded to the subscriber, where it is also written to disk. This store-and-forward mechanism enables our systems to gracefully recover from either side being temporarily unavailable.

In my MSDN article on this topic, I outlined some problems with this approach. These problems are exacerbated for publishers. Imagine a publisher with 40 subscribers, publishing 10 messages a second, each containing 1MB of XML. If 10 of the subscribers are unavailable, that’s 100MB of data being written to the publisher’s disk every second, 6GB every minute. That’s liable to bring down a publisher before an administrator brews a cup of coffee.

Publishers have no choice but to throw away messages after a certain period of time.

Publisher Contracts

The whole issue of contracts and schema is considered one of the better understand parts of SOA. Unfortunately, the operational aspects of service contracts is hardly ever taken into account.

On top of the schema of the messages a service publishers, additional information is needed in the contract:

How big will this message be?
How often will it be published?
How long will this message be stored if a subscriber is unavailable?

This first two pieces of information are important for subscribers to do load and capacity planning. The last one is the most important as it dictates the required availability and fault-tolerance characteristic of subscribers.

For Example

In the canonical retail scenario, when our sales service accepts an order, it publishes an order accepted event. Other services subscribed to this event include shipping, billing, and business intelligence.

While shipping and billing are highly available and able to keep up with the rate at which orders are accepted, the business intelligence service is not. BI has two main parts to it – a nightly batch that does the number crunching, and a UI for reporting off of the results of that number crunching. Some even do the reporting in a semi-offline fashion, emailing reports back to the user when they’re ready.

Furthermore, nobody’s going to invest in servers for making BI highly available.

And wasn’t the whole point of this publish/subscribe messaging to keep our services autonomous? That not all services have to have the same level uptime?

Houston, do we have a problem.?

Data Freshness

There is a glimmer of light in all this doom and gloom.

Not all services have the same data freshness requirements.

The business intelligence service above doesn’t need to know about orders the second they’re accepted. A daily roll-up would be fine, and an hourly roll-up bring us that much closer to “real time business intelligence”.

So, while BI is ready to accept the sales message schema, it would like a slightly different contract around it – less messages per unit of time, more data in each message.

From the operational perspective of the sales service, it would be cost effective to have less “online” subscribers. It could even take things a few steps further. Instead of using the regular messaging backbone for transmitting these hourly messages, it could use FTP. The data could even be zipped to take up even less space. Since the total data size is less than the corresponding online stream, is stored on cheaper, large storage, and the number of subscribers for this zipped, hourly update is fairly small, these messages can be kept around far longer.

If you’ve heard about consumer-driven contracts, this is it.

Note that we’re still talking about the same logical message schema.

Summary

It’s not that lost notifications aren’t a problem.

It’s that they feed the design process in such a way that the resulting service ecosystem is set up in such a way that notifications won’t get lost. I know that that sounds kind of recursive, but that’s how it works. Either subscribers take care of their SLA allowing them to process the online stream of events, or they should subscribe to a different pipe (which will have different SLA requirements, but maybe they can deal with those).

It make sense to have multiple pipes for the same logical schema.

It’s practically a necessity to make pub/sub a feasible solution.

Reliability, Availability, and Scalability

udidahan — Sat, 15 Nov 2008 21:20:20 +0000

The great people at IASA have made the recording for my webcast available online:

The slides can be found here.

I also gave this talk at TechEd Barcelona and wanted to thank the attendee who posted this comment:

“You’ve done it again. Everytime I attend a session of yours I leave the room with new insights and inspiration on how to improve my software…”

You made my day.

Durable Messaging Dilemmas

udidahan — Thu, 17 Jul 2008 22:18:47 +0000

I’ve received some great feedback on my MSDN article and some really great questions that I think more people are wondering about, so I think I’ll try to do a post per question and see how that goes.

Libor asks:

“Would you recommend using durable messaging for systems where there are similar requirements with respect to data reliability as you had – ie. not losing any messages? If so, then why didn’t the final version of your solution use it? If not, can you explain why?”

The answer is, as always, it depends, but here’s on what it depends:

When designing a system, we need to take a good, hard look at how we manage state, and what properties that state has. In a system of reasonable size we can expect various families of state with respect to their business value, data volatility, and fault-tolerance window. Each family needs to be treated differently. While durable messaging may be suitable for one, it may be overkill or underkill for another.

So, here’s what we’re going to be looking at:

Business Value
Data Volatility
Fault-Tolerance Window

Business Value

When talking about business value, I want to talk about what it means “not losing any messages”. The question is under what conditions will the messages not be lost, or rather, what are the threshold conditions where messages may start getting lost. If all our datacenters are nuked, we will lose data. It’s likely the business is OK with that (as much as can be expected under those circumstances). If a single server goes down, it’s likely the business would not be OK with losing messages containing financial data. However if a message requesting the health of a server were to get lost under those same conditions, that would probably be alright. In other words, what does that message represent in business terms.

Data Volatility

Data volatility also has an impact. Let’s say that we’re building a financial trading system. The time that it takes us to respond to an event (message) that the cost of a certain financial instrument has changed, and the message that we send requesting to buy that security is critical. Let’s say that has to be done in under 10ms. Now, some failure has occurred preventing our message from reaching its destination for 20ms. What should we do with that message? Should we keep it around, making sure it doesn’t get lost? Not in this domain. On the contrary, that message should be thrown away as its “business lifetime” has been exceeded. Furthermore, even during that original period of 10ms, the use of durable messaging may make it close to impossible to maintain our response times.

Fault-Tolerance Window

These two topics feed into the third and more architectural one – fault-tolerance window: what period of time do we require fault tolerance, and with respect to how many (and what kind of) faults? This will lead us into an analysis of to how many machines do we need to copy a message before we release the calling thread. We’d also look at in which datacenters those machines reside. This will also impact (or be impacted by) the kinds of links we have to these datacenters if we want to maintain response times. These numbers will need to change when the system identifies a disaster – degrading itself to a lower level of fault-tolerance after a hurricane knocks out a datacenter, and returning to normal once it comes back up.

Re-Evaluating Durable Messaging

Durable messaging may be used at various points in each part of the solution, but we need to look at message size, the rate those messages are being written to disk, how fast the disk is, how much available disk we have (so we don’t make things worse in the case of degraded service), etc. Companies like Amazon also take into account disk failure rates, replacement rates (disks aren’t replaced immediately you know), and many other factors when making these decisions

Summary

Our job as architects when designing the system is to find that cost-benefit balance for the various parts of the system according to these very applicative parameters. No, it’s not easy. No, cloud computing will not magically solve all of this for us. But, we are getting more technical tools to work with, operations staff is getting better at working with us in the design phase, and our thought processes more rigorous in dealing with the scary conditions of the real world.

To your question, Libor, as to why we didn’t eventually use durable messaging in our solution, the answer is that we solved the overall state management problem by setting up an applicative protocol with our partners which was resilient in the face of faults by using idempotent messages that could be resent as many times as necessary. You can read more about it here. This solution isn’t viable for other kinds of interactions but was just what we needed to get the job done.

Hope that helps.

Make WCF and WF as Scalable and Robust as NServiceBus

udidahan — Mon, 30 Jun 2008 14:47:08 +0000

This topic is getting more play as more people are using WCF and WF in real-world scenarios, so I thought I’d pull the things that I’ve been watching in this space together:

Reliability

Locking in SqlWorkflowPersistenceService (via Ron Jacobs) where, if you want predictable persistence (MS: ‘none of our customers asked for this to be easy’), you need to use a custom activity (which Ron was kind enough to supply).

“Given what I learned today I’d have to say that I’d be very careful about using workflows with an optimistic locking. Detecting these types of situations is not that simple.”

Let’s think about that. If we’re doing pessimistic locking, we get into the problem of, if a host restarts (as the result of a critical windows patch or some other unexpected occurrence), that the workflow won’t be able to be handled by any other host in the meantime (you didn’t care so much about your SLA, did you?).

Luckily, someone’s come up with a hack that works around this robustness problem in Scalable Workflow Persistence and Ownership.

“So this code will attempt to load workflow instances with expired locks every second. Is it a hack? Yes. But without one of two things in the SqlWorkflowPersistenceService its the sort of code you have to write to pick up unlocked workflow instances robustly.”

This will seriously churn the table used to store your workflows, decreasing performance of workflows that haven’t timed out. Oh well.

Testability

Implementing WCF Services without Referencing WCF (via Mark Seemann):

“More than a year ago, I wrote my first post on unit testing WCF services. One of my points back then was that you have to be careful that the service implementation doesn’t use any of the services provided by the WCF runtime environment (if you want to keep the service testable). As soon as you invoke something like OperationContext.Current, your code is not going to work in a unit testing scenario, but only when hosted by WCF.”

After pointing out some of the more basic difficulties in testability a straightforward WCF implementation brings, Mark turns the heat up in his follow-up post, Modifying Behavior of WCF-Free Service Implementations:

“Perhaps you need to control the service’s ConcurrencyMode, or perhaps you need to set UseSynchronizationContext. These options are typically controlled by the ServiceBehaviorAttribute. You may also want to provide an IInstanceProvider via a custom attribute that implements IContractBehavior. However, you can’t set these attributes on the service implementation itself, since it mustn’t have a reference to System.ServiceModel.”

Wow – all the things required to make a WCF service scalable and thread-safe make it difficult to test. In the end, we’re beginning to see how many hoops we have to go through in order to get separation of concerns, but until we can take all this and get it out of our application code, it’s an untenable solution. I hope Mark will continue with this series, if only so I can take the framework that might grow out of it and use it as a generic WCF transport for NServiceBus.

Comparison

After the Neuron-NServiceBus comparison that Sam and I had, we talked some more. After going through some of the rational and thinking, Sam even put nServiceBus into his WCF-Neuron comparison talk. Sam had this to say about nServiceBus:

“The bottom line is: I like what I see. Although it’s a framework, not an ESB product like Neuron, it’s a powerful framework that takes the right approach on SOA and enforces a paradigm of reliable one-way, *non-blocking* calls. That is the point of the talk tonight overall; we need to get away from the stack world of synchronous RPC calls to true asynchronous non-blocking message based SOA systems.”

The main concern I have with a WCF+WF based solution is that developers need to know a lot in order to make it testable, scalable, and robust. In nServiceBus, that’s baked into the design. It would be extremely difficult for a developer writing application logic to interfere with when persistence needs to happen, or the concurrency strategy of long-running workflows. The fact that message handlers in the service layer don’t need concurrency modes, instance providers, or any of that junk make them testable by default.

Advanced Messaging with a dash of DDD

udidahan — Mon, 18 Feb 2008 11:07:48 +0000

Following my last post (From CRUD to Domain-Driven Fluency) a bunch of questions have started popping up. One that I received via email from a client up in Ireland particularly caught my eye, so here it is:

Hi Udi, I think I see the point about the domain-driven approach but I’m wondering what my messages will look like. If it’s this:

IAppointment InsertInterview(Guid recruiterId, Guid applicantId, Guid appointmentId); OR

IRecuiter UpdateRecuiter(IRecuiter recruiter); (passing in an operation flag attached to the IRecuiter object) OR

IRecuiter UpdateRecuiter(IRecuiter recruiter); (setting a state flag on the relevant object and have the business object check the flag and behave according the state change)

Hope I’m not way off

Sean

Well, Sean, first of all – messages don’t look like functions. They’re a lot more like structures – data transfer objects. In this case, you’d probably be looking at a ScheduleInterviewMessage that had the relevant fields. It would look something like this:

   1:  using System;

   2:  using NServiceBus;

   3:  using System.Xml.Serialization;

4:

   5:  namespace Messages

   6:  {

   7:      [Serializable]

   8:      [Recoverable]

   9:      [TimeToBeReceived("0:01:00.000")]

  10:      public class ScheduleInterviewMessage : IMessage

  11:      {

  12:          public Guid InterviewerId;

  13:          public Guid CandidateId;

  14:          public DateTime RequestedTime;

15:

  16:          [XmlAnyElement]

  17:          public object extra;

  18:      }

  19:  }

Before we go on, I want to explain what we see. The “recoverable” attribute is the way we indicate to the infrastructure that these messages should not be lost in case a server fails, there are network problems, etc. In essence, it does durable, store-and-forward messaging. This will create an environment in which, in the case of network problems, these messages will be written to disk. That’s a good thing, since once connectivity comes back or the server boots back up, the messages will still be around and can be sent.

Now these messages are fairly small, so even at a relatively high load, we shouldn’t be chewing through too much of our expensive, small, high performance local disks. However, if these messages were bigger, we may fill up our disks before connectivity comes back, and we all know what happens to Windows boxes when there’s no room on the file system left:

In order to prevent our system from Denial-of-Servicing itself we need to make those messages clean themselves up. That’s what the “TimeToBeReceived” attribute is for. The amount of time that if a message had not yet been received by the other side that it will be deleted. This could be that the message even made it to the other machine, but the process handling those messages was down. You wouldn’t want to be filling the other side’s disk either causing them to crash, would you? This protects both parties.

The way to figure out how long to set is by looking at the smallest amount of durable storage you have available at your nodes, divide that by the size of the average message, and then again by the rate you need to process messages – and leave yourself at least 100% spare.

In other words, to build a robust system you not only will need to deal with lost messages, but you will be actively throwing messages away.

Finally, that last “XmlAnyElement” attribute is there for versioning. As we version our system and schema, we’ll be adding fields to the message. However, an old client may be talking to a new server, or vice versa. Since we wouldn’t want data to get lost just because of serialization. In a future post, I’ll show how to set up a message handler pipeline exactly for these issues.

Now that we’ve covered all the intricacies around messaging, we can see how the code that handles that above message looks:

   1:  using System;

   2:  using Messages;

   3:  using NServiceBus;

   4:  using NHibernate;

5:

   6:  namespace Server

   7:  {

   8:      public class ScheduleInterviewMessageHandler :

   9:                   BaseMessageHandler

  10:      {

  11:          public override void Handle(ScheduleInterviewMessage message)

  12:          {

  13:              using (ISession session = SessionFactory.OpenSession())

  14:              using (ITransaction tx = session.BeginTransaction())

  15:              {

  16:                  ICandidateInterviewer interviewer = session.Get(

  17:                          message.InterviewerId);

  18:                  ICandidate candidate = session.Get(

  19:                          message.CandidateId);

20:

  21:                  interviewer.ScheduleInterviewWith(candidate)

  22:                          .At(message.RequestedTime);

23:

  24:                  tx.Commit();

  25:              }

26:

  27:              // publish new appointment data

  28:          }

  29:      }

  30:  }

If you’ve read this far and have more questions, please feel free to send them my way. If you’re at a more time-critical part of your project and need an answer quickly, we can set up a skype call. This has been working quite well for many of my overseas clients (shout out to the guys in Ireland and Florida).

Until next time

Reliability – Udi Dahan – The Software Simplist

UI Composition vs. Server-side Orchestration

On failures

Back to the specific example

In summary

Logically Distributed, Physically Centralized

Virtualization

Databases too

In closing

NServiceBus in Insurance – Testimonial

Video Online: Who Needs a Service Bus Anyway?

The Myth Of “Infinite Scalability”

The basics

Linear Scalability

But there’s more

Preparing for “infinity”

Inconsistent data, poor performance, or SOA – pick one

The problem

Some solutions

The solution to the performance issues

Next steps

NServiceBus 2.5 Released

Why NServiceBus is better than WCF for your distributed systems

Don’t wait any longer

High Availability Presentation

Lost Notifications? No Problem.

Durable Messaging

Publisher Contracts

For Example

Data Freshness

Summary

Related Content

Reliability, Availability, and Scalability

Durable Messaging Dilemmas

Business Value

Data Volatility

Fault-Tolerance Window

Re-Evaluating Durable Messaging

Summary

Make WCF and WF as Scalable and Robust as NServiceBus

Reliability

Testability

Comparison

Advanced Messaging with a dash of DDD

In other words, to build a robust system you not only will need to deal with lost messages, but you will be actively throwing messages away.