Architecture – Udi Dahan – The Software Simplist

Microservices presentation [London 2014]

udidahan — Tue, 21 Jul 2015 11:23:45 +0000

… in which I realize I shouldn’t put off blogging about the presentations I’ve given.

This one is from µCon 2014: The Microservices Conference at Skills Matter in London.

The title of this talk was: An Integrated Services Approach

and the description:

After many years of the largely enterprise-scale SOA philosophy being applied across multiple systems, we’re now seeing some of that philosophy being applied to the design of the systems themselves with Microservices. Unfortunately, unless we integrate these enterprise and system level philosophies appropriately, we’ll end up with a mess of data duplication and coupling that may even result in businesses running on inconsistent data. Join Udi for a discussion of a unified approach that leverages the best of both worlds.

Hope you find it interesting.

Service-Oriented Composition (with video)

udidahan — Wed, 30 Jul 2014 12:44:48 +0000

When telling people about my approach to SOA, in which a given service would have client/browser-side components running side-by-side in the same process and even in the same page as components from other services, I often get asked this question:

“Doesn’t all of this loosely-coupled composition come with a high cost, in terms of client to server chit-chat?”

So, I’ve finally buckled down and put together a slide to illustrate how the technocratic IT/Ops service I’ve talked about in the past can provide components to resolve these sorts of problems.

After putting the slide together, and realizing some animation would do it good, I went and made a short (5 min) video including some verbal explanation as to how it all works – just for clarity. Check it out or watch it here:

And here’s the image showing everything in one picture:

People, Politics, and the Single Responsibility Principle

udidahan — Mon, 26 May 2014 06:24:59 +0000

In one of Uncle Bob’s recent blog posts on the Single Responsibility Principle he uses the example of using people and organization boundaries as an indication of possible good software boundaries:

When you write a software module, you want to make sure that when changes are requested, those changes can only originate from a single person, or rather, a single tightly coupled group of people representing a single narrowly defined business function. You want to isolate your modules from the complexities of the organization as a whole, and design your systems such that each module is responsible (responds to) the needs of just that one business function.

This is something that often comes up when I teach people about service boundaries when it comes to SOA – organization boundaries are the most intuitive choice.

And, once up on a time, that intuition might have indeed held up.

Stepping back in time

In the age before computers, organizations had a very specific way of structuring themselves.

People who had to work closely together sat in close physical proximity to each other. Data that was required on an ongoing basis would be in file cabinets also physically co-located with the people using that data, and it would be structured in a way that was optimal for their specific purposes. All of this was due to the high cost of communicating with people farther away.

If you needed data from a different department, you had requisition it by filling out a special form, put it in your outbox, and then some guy from the mail room would pick it up, and physically schlep it to the right department, putting it in their inbox, and then someone there would get your data for you – putting it together with your original request, and then the mail guy would schlep it back. This inbox/outbox style of communication should ring a bell from the messaging patterns I talk about with NServiceBus.

As a result, different departments had to have very clearly delineated responsibilities with minimal overlap with each other. The organization just couldn’t function any other way.

And then a bunch of us geeks came along.

Enter the age of computers and networks

By introducing this technology, the cost of communication across large distances started falling – slowly at first, and then quite dramatically.

When anyone in an organization was able access data from anywhere in the blink of an eye, an interesting dynamic started to unfold. All of a sudden, the division of responsibility between departments wasn’t as critical as it was before. When an employee needed to do something, there wasn’t this “that isn’t our job, you need to go to so-and-so” reaction. Because things could be done instantly, that’s exactly what happened.

And then came the politics

By removing the cost of communication, it became possible for more power-hungry people in the organization to start making (or trying to make) decisions that they couldn’t have made before. The introduction of computers into an organization was heralded as a new way of doing business – that the old organizational boundaries were a relic that we should leave behind us.

And thus can the re-org (the first of many).

Responsibilities and people were shuffled around, managers vied for more power, and politics took its’ place as one of the driving forces in the company structure.

Nowadays, if you want a decision made in a company, there isn’t just one person who has the authority to sign off on it anymore. No, you need to have meetings – and more meetings, with people you never knew existed in the company, or why on earth they should have a say on how something is supposed to get done. But that is now our reality: endlessly partially overlapping responsibilities across the organization.

So, what of the Single Responsibility Principle

This just makes it that much harder to decide how to structure our software – there is no map with nice clean borders. We need to be able to see past the organizational dysfunction around us, possibly looking for how the company might have worked 100 years ago if everything was done by paper. While this might be possible in domains that have been around that long (like banking, shipping, etc) but even there, given the networked world we now live in, things that used to be done entirely within a single company are now spread across many different entities taking part in transnational value networks.

In short – it’s freakin’ hard.

But it’s still important.

Just don’t buy too deeply into the idea that by getting the responsibilities of your software right, that you will somehow reduce the impact that all of that business dysfunction has on you as a software developer. Part of the maturation process for a company is cleaning up its’ business processes in parallel to cleaning up its’ software processes.

The good news is that you’ll always have a job

On that Microservices thing

udidahan — Mon, 31 Mar 2014 16:40:36 +0000

Seems that I’m a bit late to the Microservices party – original article here.

But since I’ve been getting repeated requests to weigh in on the topic, I guess I’ll have to risk fanning the flames up again.

Also, since quite a few reactions have already been written on the topic (and I don’t want to repeat them here), I’ll just point to this post by Arnon which sums them all up pretty well.

Now, I don’t entirely agree with all the commentary Arnon pointed to, or all of his thoughts on the topic, but I’ll try to take those up some other time.

And before jumping into it, let me say that there is a lot of good stuff in the article and that, regardless of naming, spreading the word more broadly on these approaches has value.

So, where do I stand on the topic

First of all, for those of you who have been following my blog for a while I’d say this:

Microservices almost equals Autonomous Components.

Why “almost”?

Because an Autonomous Component (AC) isn’t necessarily a physical unit of deployment – very often we’ll see multiple ACs deployed in the same physical process. One of the most common occurrences is in a web front end built as a composite UI. In the same web server process we’ll see components from multiple Services.

This is something that was hardly mentioned in the original article.

On Services and Systems

In my world, Services are a larger organizing principle that are meant to align solution domain boundaries with problem domain boundaries.

Now, that might sound very similar to this passage from the original article:

“The microservice approach to division is different, splitting up into services organized around business capability. Such services take a broad-stack implementation of software for that business area, including user-interface, persistant storage, and any external collaborations. Consequently the teams are cross-functional, including the full range of skills required for the development: user-experience, database, and project management.”

Now, this isn’t entirely surprising because I did have several conversations with both James and Martin on the topic over the past couple of years.

Still, there is something important missing here that I believe is very important to achieve loose-coupling, and that is that Services necessarily have to span system boundaries.

Let me repeat that: a Service will need to have components that are deployed to more than one system.

Here’s why:

Let’s say you have a piece of data like the price of a product. Not only will that data be visible in one system, often it will need to be shown (as well as updated) in other systems too. In order to have appropriate encapsulation of that concept, the owning service will need to be the one that owns the components that operate on that concept in the other systems.

This means that if we need to show the product price on an invoice in a back-end system, then that invoice would have to be a composite UI as well, and the service which owns the price will have a component deployed there which would be responsible for showing the price on the invoice.

In this manner, no code outside the service boundary would know about the concept of the product price and thus could not end up coupled to it.

Although the original article does get into this to some degree (when talking about Decentralized Data Management), I don’t really see how a microservice/AC could end up having this level of ownership of data.

Still, the point made about different persistence technologies is valid at the level of services (though not ACs).

How big is a service

Now, if the price is not shared outside the boundary of the service, then how would order totals be calculated?

The answer is that the totals must (MUST) be calculated in the same service.

This shouldn’t be surprising as it’s just good old OO – encapsulating data with the logic that operates on it. Or, if you’d like, call it the Single Responsibility Principle: there should only be a single service impacted by a change to the definition of this data.

As a result, you’ll tend to see services that aren’t all that small, and probably not so many of them. In my experience, I’ve seen between 7 and 15 services the majority of the time.

Cross-service collaboration

Although I am glad to see the recommendation for event-based interactions between microservices, the focus on cross-process communication ignores some extremely important collaboration scenarios – the most important of which is in the client tier.

In a web application, it is quite common to have components written in javascript from multiple services interacting among themselves in the browser – one publishing events, others subscribing to those JS events. It is also quite common to see those JS components request some data from back-end components in the same service in response to those JS events.

This type of synchronous RPC communication within a service boundary is perfectly acceptable, although it stands in contrast to the recommendations of the microservices approach.

Caveat on sharing data

I’ve been going on and on about the importance of not sharing data, but there is one exception to that rule.

There is a special service that I call IT/Ops which (among other things) is responsible for integration with 3rd party systems. As a part of this integration, it encapsulates data transformation logic and, as a result, needs to be able to receive data from the other more business-centric services.

As you can imagine, this puts IT/Ops in the risky position of coupling itself to a lot of things and thus needs to be done carefully. As a result, I recommend that many of the most skilled technical people work within the IT/Ops team, also serving in a consultative capacity to the other service teams.

In closing

I am extremely thankful to Martin and James for writing the Microservices article.

I think that the conversations it has sparked are timely, and hopefully more people will ponder these questions of how to structure their code-bases in order to avoid them becoming monolithic.

And while I think that it’s great to consider aligning team boundaries with service boundaries, people need to understand that it will need to be an evolutionary process – it will take time to transition an existing code base and an existing team structure to this new model, especially since these teams will have to continue to deliver features and bug fixes through the transition period. Jumping to the new model directly may cause more harm than good.

This is actually one of the most salient topics of my course (next one in NYC in May) – how do you get there from here. In my experience there are 4 phases that companies go through, often taking at least a couple of years, with larger environments taking potentially longer.

In any case, let’s keep the conversation going.

What are your thoughts? Have you been applying the Microservices approach, or possibly the one I talk about (I really should give it a name). What’s been working well for you? What hasn’t?

Leave me a comment or write your own blog post.

Thoughts on a career in software development

udidahan — Fri, 27 Dec 2013 13:55:22 +0000

For much of the history of computers, programmers really only had one path to take – upward into management.

While you could go from Junior Programmer to Senior Programmer, sooner or later you were faced with the choice of becoming a Team Lead or having your career stagnate.

The primary difficult with becoming a team lead is that the skills that made you an excellent Senior Programmer didn’t really carry over to leading a team.

On leading teams

Much ink has been spilled (and keyboards been pounded) on this topic, so I’ll just give the common solution that is proposed to this issue – having a parallel technical career track to the traditional management track.

After being a senior programmer, developers can grow into architects and upwards. IBM, for example, has the title “Fellow” reserved for this ultimate level.

An IBM Fellow is an appointed position at IBM made by IBM’s CEO. Typically only 4 to 9 IBM Fellows are appointed each year in May or June. It is the highest honor a scientist, engineer, or programmer at IBM can achieve. —Wikipedia

All that’s well and good, but I have a feeling that something is still missing.

Why does it have to be either/or?

What if we allowed, nay – encouraged, developers to try both types of roles as they advance in their career?

After your first year as a senior programmer, you are then assigned to be a mentor to a junior programmer. You don’t assign them work, but take responsibility for some of their professional development. During this time, you also start learning what it takes to be a good team lead and developing your soft skills – yourself being mentored by a more senior team lead (probably not a good idea for it to be the team lead on the project you’re working on). From there, you take on a team lead role on a small-ish project leading 2-3 other developers.

During your time as a team lead, architects in the company work with you to deepen your technical knowledge of larger system concerns – grooming you for your next role: an architect. Your experience as a team lead gives you new found appreciation for managing technical risk.

Later, as an architect with developed soft-skills, you are now much more capable of getting teams to adopt your ideas and to want to do “the right thing”, rather than just deliver the project any way they know how. Even as you develop your expertise in various technological areas, the organization has an eye on bringing you back to being a team lead, this time on a larger project.

In praise of the zigzag

I think this has actually been happening more than just a little in our industry, although I believe it usually happens as people move from one company to another.

Is it possible that the limitations of the structures in their previous companies contributed to their choice to move to another company? Well, I wouldn’t discount it.

I don’t think companies should pigeonhole developers – either you’re an X or a Y.

Human beings thrive on variety and occasionally stretching out of their comfort zone.

I believe this model optimizes for people’s personal growth and can be tuned quite easily.

I think this is also very much in alignment with the Software Craftsmanship movement and can make it easier for companies to develop and hold on to the talent they’ve been lucky enough to hire in the first place.

In closing

While I believe this model is workable for a lot of the software industry, both for consulting companies and internal development organization, it’s clearly not going to be applicable for small startups. That being said, if the startup is successful and starts growing, it might not be such a bad idea to lay the groundwork for the team early on.

I don’t claim that this is the “best model” out there, and I haven’t tried it myself (yet), but I do believe that it has potential and would love to (re)spark the conversation about processes and structure that has seemed to die down under the Agile maxim of “Individuals and interactions over processes and tools” (though I believe that Agile as a whole *has* gotten us pointed in a better direction than where the industry was before).

What are your thoughts?

Write a comment or, even better, write something on your blog and spread the word!

Ask Udi 1: Alternative Architectures & Preaching to the Unconverted

udidahan — Fri, 28 Jun 2013 10:50:08 +0000

As promised, the podcast is back.

Download episode 1 here and then Subscribe to the feed.

There were 16 questions submitted and a couple hundred votes for the various topics. I was able to cover the top two questions.

Do you have a question you want to ask?
Want to vote on which questions will be answered next week?
Click here

This week’s questions

Rob Eisenberg asked:

It seems that every project I walk into has the exact same architecture, regardless of what the company is building. It’s that standard 3-tier pattern: data-business-presentation. But, there are other large-scale architectural patterns available. I’d love to hear some case studies that pair business problems with the rationale for choosing an “alternative architecture.”

And since it’s not just about knowing the right approach but also being able to convince others, I included Rvonwink’s question too:

Some of us see the genuine benefits of pub/sub, EDA and SOA design. However, how do you go about persuading the cynics, time pressed and uninformed:

Our developers hate debugging pub/sub models; Others love the ‘simplicity’ of monolithic domains; Our DBA questions why messaging is required (since “the bus simply persists messages elsewhere”); Our sys admins hate deploying new applications or changing the deployment topology; Our boss is scared to tell the business there is a little extra work to start splitting apart services.

Next week

Currently the top questions for next week are:

Composite UI, Business Components and Deployment
How to handle predetermined technology choices
How do you manage NULL pointer exception in general?

What would you like to hear? Let me know.

Until next week…

Queries, Patterns, and Search – food for thought

udidahan — Sun, 28 Apr 2013 10:44:56 +0000

With all the talk of CQRS, the area that doesn’t get enough treatment (in my opinion) is that of queries. Many are already beginning to understand the importance of task-based UIs and how that aligns to the underlying commands being sent, validated, and processed in the system as well as the benefits of messaging-centric infrastructure (like NServiceBus) for handling those commands reliably. When it comes to queries, though, it isn’t nearly as well understood what it means for a query to be “task based”.

Starting with CRUD

Let’s start with a traditional CRUD application and work our way out from there.

In these environments, we often see users asking us to build “excel-like” screens that allow them to view a set of data as well as sort, filter, and group that data along various axes. While we might not get this requirement right away, after some time users begin to ask us to allow them to “save” a certain “query” that they have set up, providing it some kind of name.

That, right there, is a task-based query and it is the beginning of deeper domain insight.

Pattern matching

Any time a user is repeatedly running the same query (this can be once a day or some other unit of time) there is some scenario that the business is trying to identify and is using that user as a pattern-matching engine to see if the data indicates that that scenario has occurred.

It’s quite common for us to get a requirement to add some field (often a boolean or enum) to an entity which defaults to some value and then see that same field used in filtering other queries. These measures are sometimes instituted as a temporary stop-gap while a larger feature is being implemented, though (as the saying goes) there is nothing more permanent than a temporary solution.

Where we developers go wrong

The thing is, many developers don’t notice these sorts of things happening because we don’t actually look at the kinds of queries users are running.

One excellent technique to better understand a domain is to sit down with your users while they’re working and ask them, “what made you run that query just now?”, “why that specific set of filters?”.

What I’ve noticed over the years is that our users find very creative ways to achieve their business objectives despite the limitations of the system that they’re working with. We developers ultimately see these as requirements, but they are better interpreted as workarounds.

I’ll talk some more about how a software development organization should deal with these workarounds in a future post, but I want to focus back in on the queries for now.

Oh, and don’t get me started on caching or NoSQL, not that I think that those tools don’t provide value – they do, but they’re only relevant once you know which business problem you’re solving and why.

Not all queries are created equal

Even before bringing up the questions I described in the previous section, any time you get query-centric requirements the first question to ask is “how often will the user be running this specific query?”.

If the answer is that the specific query will be run periodically (every day, week, etc), then drill deeper to see what pattern the user will be looking for in the data. If the person you’re talking to doesn’t know to answer that question, then go find someone who does. Every periodic query I’ve seen has some pattern behind it – and in my conversations with thousands of other developers over the years, I’ve seen that this is not just my personal experience.

But there is a case where a query does get run repeatedly without there being a pattern behind it.

I know this sounds like I’m contradicting myself, but the distinction is the word “specific” that I emphasized above.

There are certain users who behave very differently from other users – these users are often doing what I call research, i.e. the “I don’t know what I’m looking for but I’ll know it when I see it” people.

These researchers tend to repeatedly query the data in the system however they tend to run different queries all the time. This is the reason why traditional data warehouse type solutions don’t tend to work well for them. Data warehouses are optimized for running specific queries repeatedly.

Keeping the Single-Responsibility Principle in mind – we should not try to create a single query mechanism that will address these two very different and independently evolving needs.

And now on to Search

Search is a feature that is needed in many systems and whose complexity is greatly underestimated.

While the developer community has taken some decent strides in understanding that search needs to be treated differently from other queries, the common Lucene/Solr solutions that are applied are often overwhelmed by the size of the data set on which the business operates.

The problem is compounded by our user population being spoiled by Google – that simple little text box and voila, exactly what you’re looking for magically appears instantaneously. They don’t understand (or care) how much engineering effort went into making that “just work”.

Lucene and Solr work well when your data set isn’t too large, and then they become pretty useless as the quality of their results degrades. The thing is that many of us in IT tend to work on projects where we have an unrealistically small data set that we use to test the system and, at these volumes, it looks like our solutions work great. But if you have 20 million customers, do you think a full text search on “Smith” is going to find just the right one?

Larger data sets require a relevance engine – something that feeds off of what users do AFTER the query to influence the results of future queries. Did the user page to the next screen? That needs to be fed back in. Did they click on one of the results? That needs to be fed back in too. Did they go back to the search and do another similar search right after looking at a result – that should possibly undo the previous feedback.

And that’s just relevance for beginners.

You know what makes Google, you know, Google? It’s that they have this absolutely massive data set of what users do after the query that informs which results they return when. You probably don’t have that. That and search is/was their main business for many years – I’m betting that it’s not your main business.

You should discuss this with your stakeholders the next time they ask for search functionality in your system.

In closing

I know that the common CQRS talking points tell you to keep your queries simple, but that doesn’t mean that simple is easy.

It takes a fair bit of domain understanding to figure out what the queries in the system are supposed to be – what tasks users are trying to achieve through these queries. And even when you do reach this understanding, convincing various business stakeholders to change the design of the UI to reflect these insights is far from easy.

It often seems like the reasonable solution to give our users everything, to not limit them in any way, and then they’ll be able to do anything. What ends up happening is that our users end up drowning in a sea of data, unable to see the forest for the trees, ultimately resulting in the company not noticing important trends quickly enough (or at all) and therefore making poor business decisions.

Even if your company doesn’t believe itself to be in “Big Data” territory, I’d suggest talking with the people on the “front lines” just in case. Many of them will report feeling overwhelmed by the quantity of stuff (to use the correct scientific term) they need to deal with.

It’s not about Lucene, Solr, OData, SSRS, or any other technology.

It’s on you. Go get ’em.

Life without distributed transactions

udidahan — Mon, 31 Dec 2012 10:34:44 +0000

Occasionally I get questions about the issue of transactional messaging – why is it so important, why does NServiceBus default to this behavior, and if we didn’t use it, what bad things could happen. I’m talking specifically about the ability to enlist a queue in a distributed transaction here.

I think the reason for this interest is the rise in popularity of cloud platforms and queuing systems like RabbitMQ (which don’t support distributed transactions) and the difficulty of setting up distributed transactions even in on-premise.

Of course, there’s also the regular scalability hand-wringing going on even though most people wouldn’t bump up against those limits anyway.

In this post, I’ll talk about the nature of the problem, explain the pitfalls in some of the common solutions, but I’ll put off the description of how to provide consistency without distributed transactions to a future post as this one is already going to be quite long.

I’ll start with the basic fault-tolerance issues and then explain how things spiral out from there.

Starting with the basics

OK, so we have a queuing system in place that dispatches messages to our business logic which does some transactional work against a database.

Let’s say that we completed the transaction against our database but before we could acknowledge to the queue that the message was processed successfully, our machine crashed. What our machine comes up again, the queue will once again dispatch us the same message. Unless we have some logic to detect that we’ve already processed it (called “idempotence” in the REST community), we will end up processing it again.

In short, the problem is duplicates.

Attempted solutions to the duplicate problem

Most queuing systems don’t do anything about duplicates, actually giving it a proper architectural name: At-least-once message delivery, as opposed to the Once-and-only-once model that a queue that supports distributed transaction provides.

The solution often suggested is to have your logic check to see if it has already processed a message with that ID before – in essence storing the ID of each message processed for some period of time. Of course, there is some performance overhead with that, but it might be a small price to pay compared to dealing with it in the logic of every use case.

On the other hand, you’ll often have some messages (like Update commands) for which it looks like you can safely process them multiple times, in which case you might want not to pay the performance overhead there. The thing is, if your logic publishes an event in addition to the regular database work (something that is quite common) and you process the same message twice, you will probably end up publishing the event twice as well.

These duplicates are different in that here we have two distinct messages with different IDs that contain the same business data. This means that recipients of these messages will not be able to filter them out at an infrastructure level anymore.

NOTE: Deduplication abilities in queues

Although the Azure Service Bus doesn’t support distributed transactions meaning you still have the issue mentioned above, Microsoft added the ability to detect and filter out duplicates based on message contents rather than just the ID. This helps quite a bit but it’s important to understand that that doesn’t cover everything for you. Let me explain:

More complex logic

In some of your most important use cases, you may have both entity updates as well as entity creation happening together in your domain model. You might be using some kind of event model (like I wrote about here) to percolate out the information that an entity was created in order to keep your service layer decoupled from the internals of the domain model.

In the callback code from these domain events, you will likely publish out an event on the queuing system containing information like the ID of the entities created as well as other business data. And there’s the rub.

You see, without distributed transactions, you can run into some problematic scenarios:

For example, if you don’t make sure that your event publishing calls to the queuing system include the same transaction object as the one you used when retrieving the original message from the queue, then those calls could “escape” before you know if the database transaction is going to succeed. Deadlocks always happen at the lousiest times. Anyway, if you’re using database generated IDs for your entities, then those IDs will get published out in events despite the database rolling back and your subscribers will now be making decisions on wrong data – not just eventually consistent data.

In this case, processing the message again doesn’t really solve the problem – it just means that you’ll be publishing events with different IDs, so an infrastructure like Azure Service Bus couldn’t really de-duplicate them.

On the other hand, if you do use the same transaction and combine in the infrastructural message ID based de-duplication described above (as identifying duplicate calls for complex business logic is damn hard), you’ll run into another problem.

Consider what would happen if your server crashes right after finishing its database work but before it completes the transaction against the queuing system. When going to retry the message, the infrastructure filtering thing would know not to call your business logic again and that message would be quietly swallowed. Unfortunately, the event publishing calls to the queuing system from the first time the message was processed were rolled back and since your business logic isn’t called again, the event publishing won’t happen again.

Oops.

In closing

I hope I’ve been able to clarify what kind of scenarios distributed transactions solve for you and some of the difficulties in solving them yourself.

Now, to be clear, you could solve these problems by going in-depth on each of your use cases, analyzing the consistency needs and structuring the code differently to address those needs. But give this another thought, if our consistency is dependent on calling otherwise independent APIs in exactly the right order, and that a change in this order would not cause any visible functional effects, what would happen when developers with less expertise maintain this code?

The folks in the event sourcing community have their solution to this which is based on writing their business logic differently. As the adoption of this pattern is still pretty limited (probably still in the Innovator section of the Technology Adoption Curve), it’ll be interesting to see how it holds up with larger teams in the mainstream.

Oh, and in case it wasn’t clear from before, the guys in the REST community haven’t even begun addressing this problem when it comes to server-to-server integration.

We’re working on a solution for this with NServiceBus that won’t require you to change how you write business logic. We’ve got one big release to do before we can roll this in, and that’s coming soon (with all sorts of cool things like support for ActiveMQ and queues in the database). The solution we’ve found is architecturally sound but you’ll have to wait for my next post to hear about it.

Stay tuned.

Service-Oriented API implementations

udidahan — Mon, 10 Dec 2012 15:29:27 +0000

It’s quite common for our systems to need to expose an API for external parties to call that isn’t exactly aligned with our service boundaries – at least, when you follow the “vertical services” model rather than the “layered services” approach. I’ve blogged many times about the problems with layering, so I won’t go into that now beyond to say that you really, REALLY, should avoid it.

A short intro to SOA, done right

In the “vertical services” approach I espouse, you often see components from multiple services deployed to any given endpoint. While these services usually don’t need to communicate with each other at all, occasionally you’ll see them leaving “breadcrumbs” behind for each other – things like UserId or OrderId in the session. What’s especially important is that these IDs are accessible even before the entity is finalized – this enables each service to collect its own data without needing any other service to know about that data.

In an ecommerce environment, we would see one service owning money, another the product catalog (excluding prices, those would be owned by the previous service), another service owning the customers’ payment info (credit cards, etc), and yet another owning shipping addresses – all of these separate from the one that owns the shopping cart. Let’s call these Finance, Catalog, Payment, Shipping, and finally Shopping – just so that we have something to reference later.

The API

While we can do all sorts of cool browser composition with the UI in our own system, enabling each service to collect and display its own information, if we want to expose an API for clients to call, we wouldn’t want to force those clients to have to make a separate call to each service in order to make a purchase. Instead, we’d want something that looks like:

MakePurchase(Guid orderId, Dictionary cart, CreditCardInfo cc, Address shippingAddress)

In case you were wondering, the service which owns the definition of this API is different from all of the above services – it is a service that is primarily technical in nature and is responsible for things like integration and data transformations. I call this service IT/Ops.

Getting the data from the API to the services

So, we don’t want any of our business-centric services to know about anybody else’s data structures, so that leaves it to IT/Ops to pass the data to them. The thing is that we still want to do that in the most loosely-coupled way possible – with messaging being a good candidate for that.

So, what we’ll do is have IT/Ops send a message containing the data to the other services, but with a slight twist.

Here’s what the code would look like with NServiceBus:

Bus.SendLocal(new Order { Id = orderId, Cart = cart, 
                          CreditCard = cc, ShippingAddress = shippingAddress });

Why the SendLocal?

So that the components from the other services can all run together with IT/Ops in the same process giving a nice tight deployment model.

UPDATE:

If you’re using a transport like RabbitMQ that is set up as a remote broker, the overhead of going back through the broker might not be worth the improved reliability you’d get by going through messaging. In that case, you might want to consider the Domain Events approach. This would give you a similar level of decoupling but then IT/Ops would need to set up an surrounding transaction, well, that is if you want all the services processing to succeed/fail as one unit. If you don’t, you’d probably be better off just sending messages from IT/Ops to endpoints that host each of the components of the other services.

End Update

Before we get to the services, let me show you the Order class – specifically, the interfaces it implements:

public class Order : ShoppingOrder, PaymentsOrder, ShippingOrder
{
    public Guid OrderId { get; set; }
    public Dictionaryint> Cart { get; set; }
    public CreditCardInfo CreditCard { get; set; }
    public Address ShippingAddress { get; set; }
}

public interface ShoppingOrder
{
    Guid OrderId { get; set; }
    Dictionaryint> Cart { get; set; }
}

public interface PaymentsOrder
{
    Guid OrderId { get; set; }
    CreditCardInfo CreditCard { get; set; }
}

public interface ShippingOrder
{
    Guid OrderId { get; set; }
    Address ShippingAddress { get; set; }
}

Each of the above interfaces represents the data that each service cares about. Therefore, each service will provide an assembly that handles that message/data and persists it to its database, like this:


public class ShoppingAPIHandler : IHandleMessages
{
    public void Handle(ShoppingOrder message)
    {
        //persist to shopping service db
    }
}

public class PaymentsAPIHandler : IHandleMessages
{
    public void Handle(PaymentsOrder message)
    {
        //persist to payment service db
    }
}

public class ShippingAPIHandler : IHandleMessages
{
    public void Handle(ShippingOrder message)
    {
        //persist to shipping service db
    }
}

Since the Order object being sent is a polymorphic match for all of these interfaces, NServiceBus knows to invoke all of these handlers. By the way, if you care about the order of invocation, then you can control that as well (but I won’t get into that here).

Also, since all of these handler assemblies are deployed to the same endpoint, and the Order object is sent just once, this means that all the handlers will be invoked in a single transaction on a single thread – either all of them succeeding, or all failing. Since they’re all connected on the same Order Id, referential integrity can be preserved as well.

Wrapping up

When you are building a system on SOA principles, you’ll often find that you need a service like IT/Ops to handle data transformation and other broker-centric tasks. While much of SOA is based on the Bus Architectural Style – meaning primarily publish/subscribe interaction between services – that doesn’t mean that your business-centric services cannot have their components deployed in the same process.

I’d go so far as to say that if you aren’t deploying components from multiple services in process with each other at least some of the time then it’s quite likely that your service boundaries are probably incorrect.

Anyway, I hope you found this post interesting. Shout out to Slawek who gave me the idea for this post.

By the way, if you’d like to learn more about these kinds of patterns, the next batch of courses is open for registration – but the early bird prices are almost over, so you’d better hurry.

JPA with REST, OData, and SQL

udidahan — Sun, 04 Nov 2012 14:56:24 +0000

Feeling a little bit rant-y today, as I just saw some more abuse of remote calls, this time on the Java side of things.

JPA is the Java Persistence API – a kind of ORM, as you’d expect. Luckily, a lot of the web services stuff was already on the way out by the time that EclipseLink DBWS came out. DBWS allowed you to expose database artifacts as web services.

I mean, it’s not like we have any other interoperable ways of accessing data, right?

Anyway, like I said, that didn’t take off, but now they’re reinventing it – this time with REST!

In case you had any doubts, REST is pure awesomeness and adding it to anything else makes it awesome too. Lest anybody take this out of context (it’s happened before), I’m being sarcastic.

Here it is.

God knows they couldn’t let Microsoft totally dominate this area with OData coming out quite some time ago. In case you were wondering, OData was designed to provide standard CRUD access of a data source over HTTP.

Of course, none of these support any transactions so if you actually wanted to do some meaningful business logic on top of this CRUD, you wouldn’t have any consistency. And, let’s face it, if you’re not doing any meaningful business logic, just basic persistence, you just do it. That problem’s been solved a long time ago.

Can we please stop reinventing SQL already?

Leveraging irrationality towards success

udidahan — Thu, 27 Sep 2012 09:10:20 +0000

We’ve all seen good ideas emerge in the software space – from objects, to components, to services, to domain models, and the *DD approaches. Yet, in most organizations, it is very hard for these ideas to get traction.

I’ve heard from countless developers and architects over the years about their frustration in getting everybody else to go along with them. “Can’t they see how much better [new approach] is over what we’re doing now?!” they ask, believing that things could and actually would be evaluated on their merits, especially in a rational field like IT.

The usual explanation I give has a couple of parts.

Conway’s Law

In 1968 Melvin Conway penned what later became known as Conway’s Law which stated:

“organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”

An important corollary of that law is that if you wish to have a significant impact on the design of a system, you would need to have a similarly significant impact on the communication structure of the organization making that system.

The main problem is that the people that tend to be pushing for DDD, IoC, CQRS, SOA, etc are usually not as strong when it comes to the soft skills that are so necessary for bringing about organizational change. The thing is that, at a minimum, these types of changes take 3 to 5 years so it really takes a long-term commitment, both from the individual and the organization.

On the rationality of people in IT

First of all, people are a whole lot less rational than they’d like to believe – or that they’d like other people to notice. In fact, people will go to great lengths to maintain the appearance of consistency and rationality, even at the cost of harm to themselves. How’s that for irrational?

Don’t take my word for it – there’s a great book on the topic: Predictably Irrational: The Hidden Forces That Shape Our Decisions. The somewhat scary thing about it is that not only are we irrational beings, but that that irrationality can be predicted and, yes, even manipulated.

Once you can understand that the people you’re trying to convince aren’t Vulcan, you have a much better chance of being effective. I’d say that, for myself, understanding my own modes of irrationality increased my effectiveness as well, and made me quite a bit happier in life too.

Why you need to bring in a consultant

This isn’t me hawking my wares – believe me, I’m busy enough as it is, but let me know when this starts to sound familiar to you.

There’s a problem in your organization – could be that you’re not delivering software fast enough, high enough quality, whatever. Suffice it to say that Management isn’t happy. You’ve been living this pain for a while and know exactly what the source of the problem is (more often than not, management has at least a hand if not a whole arm in it). You come up with some recommendations, bring them to the higher-ups, but ultimately are ignored, dismissed, or don’t even get into the room.

Some time later, management brings in a Consultant (that’s right, with a capital ‘C’) who is there to figure out what’s wrong and come up with recommendations. In some cases, especially in larger organizations, they bring in a whole bunch of them from a brand name like McKinsey or Ernst & Young.

If these guys are smart, they listen to you, ultimately presenting your analysis and recommendations to management. Of course, those higher-ups are in awe of how quickly these guys were able to understand the inner workings of their organization. That awe lends instant credibility to their recommendations which are then adopted and given powerful political backing.

And you’re sitting there thinking, “but… but… but that’s what I was saying!!”.

It’s not the message – it’s the messenger.

Let me put it another way, explained from the perspective of management – we’re having problems, you work here, ergo you’re part of the problem. Also, you don’t make that much money (compared to management), so how smart could you be? Those brand-name consultants, well, they cost a LOT, so they MUST be good (good enough to know not to work here too).

Therefore the more the consultant costs, the more likely management is to listen, which ultimately creates the conditions for success, which makes the change happen, which proves to management that they were right to bring in an expensive consultant. A vicious (or virtuous) cycle – depending on how you look at it.

Now, it doesn’t always work this way, but it does often enough to perpetuate management’s world view.

In Closing

I do hope your organization and its leaders aren’t trapped in this kind of dysfunction, but if they are, know that you’re not alone and that you can get help – either via consulting or in some books:

Some good books include Made to Stick: Why Some Ideas Survive and Others Die and the grandfather of the field: How to Win Friends & Influence People. There are countless others and there isn’t any right place to start – the most important thing to do is to start.

It’s been over 40 years since Melvin Conway’s observation and, as an industry, we’re still relearning these things – usually through the school of hard knocks. But there is an upside here – I’m pretty sure that, knowing these patterns, you could pick up on some signals during the interviewing process and find a company that’s outgrown many of these issues – one that would be able to have more meritocratic discussions on technical choices.

In the worst case, you could become a consultant and make a living off of all this irrationality

Build one to throw away

udidahan — Sat, 08 Sep 2012 21:38:10 +0000

There’s a pattern I’ve seen with companies that are starting on a large green-field project that gets them into trouble later on that I wanted to call out because the pitfall is quite subtle.

It’s something I touched on in my presentation last week about Balancing Architecture & Agile at the Agile Practitioners group in Israel but didn’t have as much time to get into.

Agile

You see, at the beginning of these projects, often our users or customers don’t exactly know what they want. Even if you go through some process of doing mock-ups, sometimes users can’t really know what they want until they can interact with working software.

This is one of the advantages of an agile approach that focuses on getting working software in the hands of users very early. It allows us to mitigate the risk of us building the wrong system.

Pitfalls

The mistake here is believing that this needs to be done in a production-ready manner, using all of the super-scalability and highly reliable infrastructural elements in our target deployment environment.

While there is value in building real functionality on top of your production infrastructure to learn more about the challenges in doing that and in that process, refining the API of your platform, baking more capabilities into it, etc, the problem is that you are slowing down the important learning of what system needs to be built.

Decouple those two things.

Build one to throw away.

Sometimes, the way to be most effective is to be somewhat inefficient.

Once you’ve gotten your users to the point that they can point at a piece of working software and say, “Yes – build me that”, then you can go about building it the right way on top of your production platform.

Let me go a step further, just to be crystal clear.

Do not unit-test that code. Do not use TDD, DDD, CQRS, Event Sourcing, SOA, etc.

Just get it done.

Architecture

You also want to have a good understanding of the use cases in your system before you make too many big architectural decisions (where that might even be just 1 decision, but you only find out after the fact). Some use cases are architecturally significant – many aren’t. This falls under the “just enough architecture up-front”, but is really very dependent on the experience and skills of the people on your team so I really can’t give you anything more prescriptive.

Of course, there’s another risk around this approach.

Organizational Dysfunction

In some organizations, there isn’t enough of an understanding or appreciation of the difference between production-ready software and a prototype. I think we’ve all been through this at one time or another. If users see “the software” works, we might be forced to deploy the prototype – all our protests for naught.

If this is the environment you’re in, don’t worry about it.

Seriously – don’t.

Growing up

You can explain to a child that fire is hot, and that they will get burned if they stick their hand in it, but all of that is theoretical – it doesn’t really sink in, because they haven’t yet had a visceral experience of “hot”.

If the organization you’re in isn’t that mature yet, they’re going to have to get burned a couple of times until they learn. The only thing you can do as a responsible adult is to *try* to get them started with smaller burns so that they’ll learn the lessons with the minimal amount of pain.

Given that the people doing the learning are higher up the food-chain, you really don’t have much influence. Accept that and move on. If necessary, cover your assets – write an email containing the “meeting minutes” in which you describe the concerns that you raised and that the decision was made against your recommendations – but phrase it gently. Depending on the state of the economy and your network, it might be a good time to start job-hunting.

In closing

I said it before but it bears repeating:

Efficiency is secondary to effectiveness.

You want to make sure your headed in the right direction before putting the pedal to the metal.

Rework is to be accepted and, yes, even embraced.

There’s even a book about it.

Have fun.

High Availability a la Dennis

udidahan — Mon, 03 Sep 2012 14:51:28 +0000

I thought I’d give a bit of a shout-out to Dennis van der Stelt whose been applying many of the principles from my course on his project and now has something of a success story to share. Although all he put out was a little tweet about it, I think if more people bug him, we can tease some more out:

First big update released of our system based on @udidahan ADSD training, makes a maintainable system & site updates in <1sec instead of 24h

— Dennis van der Stelt (@dvdstelt) September 1, 2012

It’s not just about getting the system built quickly the first time, and it’s not just about having a maintainable code base longer term, equally important is the ability to quickly deploy major releases to the system without downtime – ‘cuz when you’re system is down, it’s not providing any business value (which is what being Agile is all about).

What do you say, Dennis? Tell us some more!

UPDATE

Dennis gives the full story here.

Data Duplication and Replication

udidahan — Tue, 28 Aug 2012 08:10:59 +0000

Occasionally I’ll get questions from people who have been going down the CQRS path about why I’m so against data duplication. Aren’t the performance benefits of a denormalized view model justified, they ask. This is even more pronounced in geographically distributed systems where the “round-trip” may involve going outside your datacenter over a relatively slow link to another site.

CQRS

As his been said several times before by many others, it’s not the denormalized view model that defines CQRS.

One of the things that sometimes surprising people after going through my course is that in most cases you don’t need a denormalized view model, or at least, not the kind you think. Yes, that’s right: MOST cases.

But I don’t want to get too deep into the CQRS thing in this post – that can wait.

SOA

The big thing I’m against is raw business data being duplicated between services.

Data that can be expected to be accessible in multiple services includes things like identifiers, status information, and date-times. These date-times are used to anchor the status changes in time so that our system will behave correctly even if data/messages are processed out of order. Not all status information necessarily needs to be anchored in time explicitly – sometimes this can be implicit to the context of a given flow through the system.

For example, the Amazon.com checkout workflow.

In that flow, if you provide a shipping address that is in the US, you are presented with one set of options for shipping speed, whereas an international address will lead you to a different set of options.

Assuming that the address information of the customer and the shipping speed options are in different services, we need to propagate the status InternationalAddress(true/false) between these services in that same flow. In this case, there isn’t a need to explicitly anchor that status in time.

But what’s so bad about duplication of data between services?

The danger is that functionality ultimately follows raw business data.

You start with something small like having product prices in the catalog service, the order service, and the invoice service. Then, when you get requirements around supporting multiple currencies, you now need to implement that logic in multiple places, or create a shared library that all the services depend on.

These dependencies creep up on you slowly, tying your shoelaces together, gradually slowing down the pace of development, undermining the stability of your codebase where changes to one part of the system break other parts. It’s a slow death by a thousand cuts, and as a result nobody is exactly sure what big decision we made that caused everything to go so bad.

That’s the thing, it wasn’t viewed as a “big decision” but rather as just one “pragmatic choice” for that specific case. The first one excuses the second, which paves the way for third, and from that point on, it’s a “pattern” – how we do things around here; the proverbial slippery slope.

So what’s with the word “Replication” in the title of this post?

While data duplication between services is very dangerous, replication of business data WITHIN a service is perfectly alright.

Let’s get back into multi-site scenarios, like a retail chain that has a headquarters (HQ) and many stores. Prices are pushed out from the HQ and orders are pushed back from the stores according to some schedule.

We know that we can’t guarantee a perfect connection between all stores and the HQ at all times, therefore we copy the prices published from the HQ and store them locally in the store. Also, since we want to perform top-level analytics on the orders made at the various stores, that would be best done by having all of those orders copied locally at the HQ as well.

We should not view this movement of data from one physical location to another as duplication, but rather as replication done for performance reasons. If there were some magical always-on zero-latency network that existed, we wouldn’t need to do any of this replication.

And that’s just the thing – logical boundaries should not be impacted by these types of physical infrastructure choices (generally speaking). Since services are aligned with logical boundaries, we should expect to see them cross physical boundaries – this includes SYSTEM boundaries (since a system is really nothing more than a unit of deployment).

I know that you might be reading that and thinking “What!?” but there isn’t enough time to get into this in any more depth here. You can read some of my previous posts on the topic of SOA for more info here.

Cross-site integration without replication

There are some domains where sensitive data cannot be allowed to “rest” just anywhere. Let’s look at a healthcare environment where we’re integrating data from multiple hospitals and care providers. While all of these partners are interested in working together to make sure that patients get the best care, which means that they need to share their data with each other, they don’t want any of THEIR data to remain at any partner sites afterwards (and are quite adamant about this).

In these cases, the decision was made that performance is less important than data ownership. Personally, I don’t agree with this mindset. The fact that data is “at rest” in a location as opposed to “in flight” does not change ownership. It could be stored in an encrypted manner so that only a certain application could use it, resulting in the same overall effect, but this is an argument that I’ve never won.

People (as physical beings) put a great deal of emphasis on the physical locations of things. It’s understandable but quite counterproductive when dealing with the more abstract domain of software.

In closing

By virtue of the fact that we don’t duplicate raw business data between services, that means that the regular data structures inside a service already look very different from what they would have looked like in a traditional layered architecture with an ORM-persisted entity model.

In fact, you probably wouldn’t see very many relationships between entities at all.

Going beyond that, you probably wouldn’t see the same entities you had before. An Order wouldn’t exist the way you expect; addresses (billing and shipping) would be stored (indexed by OrderID) in one service whereas the shipping speed (also indexed by OrderId) would be in another, and the prices may well be in yet another.

It is in this manner that data does not end up being duplicated between services, but rather is composed by many services whether that is in the UI of one system, the print-outs down by a second system, or in the integration with 3rd parties done by a third system.

If performance needs to be improved, look at having these services replicate their data from one physical system to another – in-memory caching is one way of doing this, denormalized view models might be though of as another (until you realize there isn’t very much normalization within a service to begin with).

And a word from our sponsor

For those of you on “rewrite that big-ball-of-mud” projects looking to use these principles, I strongly suggest coming on one of my courses. The next one is in San Francisco and I’ve just opened up the registration for Miami.

For those of you on the other side of the Atlantic, the next courses will be in Stockholm in October and in London this December.

The schedule for next year is also coming together and it will include South Africa and Australia too.

Anyway, here’s what one attendee had to say after taking the course earlier this month:

I wanted to thank you for the excellent workshop in Toronto last week. I spent the better part of the weekend reflecting over what was presented, the insights we learned through the group exercises, and how my preconceptions of SOA have changed. By the end of the course, all the tidbits of (usually) rather ambiguous information that I’ve collected from various blogs, books, and other sources, finally coalesced into something more intelligible – one big A-HA moment if you will. Overall, I found the content of the workshop to be incredibly enlightening and it left me feeling invigorated and excited to learn more.
– Joel from Canada

Hope you’ll be able to make it.

If travel is out of the question for you, you can also look at get a recording of the course here.

One final thing

If your employer won’t foot the bill for these, please get in touch with me.
I wouldn’t want you not to be able to come just because you’re paying out of pocket.

There are very substantial discounts available.

Contact me.

Bandwidth, Priority, and Service Contracts

udidahan — Mon, 20 Aug 2012 15:55:30 +0000

Here’s a small quick tip that can help you improve the performance of important use cases in your systems. It doesn’t require very many changes to your code and can improve matters when your system is under load but won’t make much of a difference when you have capacity to spare.

This is something I talk about in the first day of my course when going through the fallacies of distributed computing – specifically fallacy #3 which talks about bandwidth.

What about bandwidth?

When it comes to network bandwidth in your datacenter, there’s a pretty good chance you’re still on gigabit ethernet. When most developers hear that prefix, “giga”, there is an instantaneous translation in their brain to “so much I don’t need to worry about it”.

The thing is that it’s GigaBIT, not GigaByte, so we’re talking about 128 MBps.

Also, keep in mind that hardly anybody programs at the level of ethernet, we’re several layers up the stack. You can expect roughly 40% the bandwidth of ethernet up in TCP land due to its collision detection, exponential backoff, etc. So that’s roughly 50MBps, not counting overhead for serialization (which can be very significant if it’s text-based like XML or JSON).

In practice, you might be getting something like 25MBps – definitely not so much that you don’t even need to think about it.

Everybody’s talking scalability in terms of number of servers and memory, storage, CPU per server – but what about the network? More importantly, what happens when (not if) you run out? Well, the latency of your calls increase – and that can be quite substantial.

Business Priority

And now we get to the crux of the matter.

Consider a “Customer” web service with a bunch of methods on it, including these two: GenerateTopCustomersByRegionReport and MarkCustomerAsFraud.

Now, your system is under significant load and there’s just enough bandwidth left for one call to make it across the network without timing out. Two users invoke the functionality above – one doing the report, the other doing the fraud. Should the fact that one user clicked the button a millisecond before the other mean that the MarkCustomerAsFraud should be delayed to the point of failure?

I’m fairly sure that if we asked a business stakeholder the answer would be a clear no.

While we could try asking the network engineers to give higher priority to that webmethod, let’s face it, that’s never going to happen. Since both methods are on the same web service, clients are bound to the address where that web service is hosted, regardless of which methods they call.

The problem with “the” network

If there’s one word I loathe in the English language, it’s the word “the”.

Such a small word, tacked on in front of so many other words that, without you even noticing it, traps your mind into thinking you can have only one of that thing.

The network.
The database.

But you CAN have more than one – it’s YOUR system. Design it however you like.

Most servers (if not all) can have more than one network card these days. And even if yours couldn’t – you can ask your network engineers to set up multiple virtual networks on top of the physical network and divide up the bandwidth between them.

Putting it together

The next step is to simply put the MarkAsFraud method on a different web service. This way, you can decide at deployment time which web service should be hosted over which network.

When your system is under load, you will then be guaranteed that even if there are a large number of low priority calls being invoked, they will not use up the network bandwidth reserved for your higher priority calls. You will likely still need to take care of processing and other IO concerns on your servers, but if worst comes to worst, you can partition your server farm as well.

While this may sound a bit CQRS-ish but it would be more accurate to say that CQRS is a more specific case of this pattern – that of partitioning the API according to business priority.

One of the interesting things about messaging is that we tend to forgo the traditional “service contracts” where many methods are put together on a single “service”. Instead, each message definition stands on its own and can be routed to any destination.

In summary

If you are still using WCF and web services, be aware that these apparently little things can have an impact on how your system behaves under load. Even if you do use MSMQ under WCF, the traditional service contract made of multiple methods will still govern your routing.

If you do go all the way with this pattern, you’ll see that each of your service contracts ends up with only one method on it. This might make you wonder what’s the point of the whole service contract thing in WCF – that would be a very good issue to resolve.

Remember the first rule of remote communication – Don’t.

Services are (still) not Remotely Callable Components

udidahan — Sun, 29 Jul 2012 18:21:05 +0000

We’re more than half-way through 2012, more than 16 years since the term SOA was introduced, and still people are treating the term “service” as if it were nothing more than a remotely-callable component (like a web service).

Take this “Bare-knuckle SOA” thing that’s started making the rounds. Not to say that the approach doesn’t have its merits (regardless of the SOA thing), but it seems to be talking about little more than regular XML and HTTP communication. Personally, I think that the functional testing done over HTTP in that post would have been much better done directly against the API of the component, bypassing HTTP entirely.

Appending the word “service” to a concept doesn’t change any of its architectural properties.

EmailService, CurrencyService, GeoLocationService – all of them just components which have methods that can be invoked remotely. If anything, the design would probably have been better if all the XML and HTTP was removed – if the components were all hosted in the same process.

In fact, in a well designed Service-Oriented Architecture, we tend to see components from many services deployed in process with each other (as I showed in my recent post UI Composition for Correct Service Boundaries).

But I’ve got to admit – the “bare knuckle” prefix, that’s some good link-bait.

UI Composition vs. Server-side Orchestration

udidahan — Mon, 09 Jul 2012 06:49:26 +0000

Following on my last post called UI composition techniques for correct service boundaries, one commentor didn’t seem to like the approach I described saying:

“I’m sorry, but with all due respect I must strongly disagree. You haven’t avoided any orchestration work at all, you’ve just moved it in to client side script!

How are you going to deal with the scenario that one of the service calls fails? Say a failed credit card payment, or no more rooms left? In more javascript??

I would much rather take the less brittle approach of introducing an orchestration service. Like it or not, however trivial it may be, there is a relationship between these services, if one call fails, they both fail. This should be reflected in the architecture, not hidden in javascript. With an orchestration service you also either get transactions for free provided by infrastructure, or alternatively if the underlying service doesnt support this, explicit and unit testable control over recovery.”

Since this is a common point of view, I thought I’d take the time to explain a bit more.

Let’s start at a fairly high level.

On failures

I’ve talked many times in the past about how to handle technical causes for failure like server crashes, database deadlocks, and even deserialization exceptions. Messaging and queuing solutions like NServiceBus can help overcome these issues such that things don’t actually fail – they just take a little longer to succeed.

On the logical side of things, the CQRS patterns I talk about describe an approach where aggressive client-side validation is done to prevent almost all logical causes for failure. The only thing that can’t be mitigated client-side are race conditions resulting in actions taken by other users at the same time.

In short, it really is uncommon for things to fail when being processed server-side.

Back to the specific example

The concerns raised in the comment specifically talked about a failed credit card payment or no rooms left in the hotel, so let’s start with the credit card thing:

In my last post I talked about collecting guest and credit card information from the user as a part of the “checkout” process when making a reservation for a hotel room. Just to be clear – there is a final “confirm your reservation” step that happens after all information has been collected.

What this means is that we aren’t actually charging the customer’s card when we collect that data, therefore there is no real issue with a failed credit card payment that needs to be handled by the client-side javascript. When the customer confirms their reservation, yes, there might be a failure when charging the card though there are only some specific types of rates for which the hotel charges your card when you make a reservation.

In general, failed credit card payments are handled pretty much the same way for all ecommerce – an email is sent to the customer asking for an alternative form of payment, also saying that their purchase won’t be processed until payment is made.

In any case, it is only after the reservation is placed that the responsible service would publish an event about that. The service which collected the credit card information would be subscribed to that event and initiate the charge of the card when that event arrives (or not, depending on the rate rules mentioned).

With regards to there not being any rooms left, well, first of all, there’s overbooking – hotels accept more reservations than rooms available because they know that customers sometimes need to cancel, and some just don’t show up. Secondly, there is a manual compensation process if more people show up than there are actual rooms to put them in. In some cases, a hotel will bump you up to a higher class of room (assuming there aren’t too many reservations for those), and in others they will call a “partner” hotel nearby and put you up there instead.

In summary

While arguments can be made that yes, these issues have been addressed in this specific example, there may be other domains where it is not possible to do these kinds of “tricks”. Although I do agree with that in theory, I’ve spent the better part of 5 years travelling around the world talking to hundreds of people in quite a few business domains, and every single time I’ve found it possible to apply these principles.

In short, the use of UI composition allows services to collect their own data, making it so anything outside that service doesn’t depend on those data structures which makes both development and versioning much easier. Technical failure conditions can be mitigated at infrastructure levels in most cases and other business logic concerns can be addressed asynchronously with respect to the data collection.

Give it a try.

UI Composition Techniques for Correct Service Boundires

udidahan — Sat, 23 Jun 2012 13:04:00 +0000

One of the things which often throws people off when looking to identify their service boundaries is the UI design. Even those who know that the screen a user is looking at is the result of multiple services working together sometimes stumble when dealing with forms that users enter data into.

Let’s take for example a screen from the Marriott.com online reservation system (below). This screen collects information about the guest staying at the hotel (name, phone number, address, etc) and credit card information.

While we might have wanted to keep guest information in a separate service from the credit card information (which may very well be the corporate card of someone responsible for travel), the above screen would seem to indicate that the data would be collected together, validated together, and would also have to be processed together.

The traditional way

In standard layered architectures you would have all the data submitted by the user passed in a single call from a controller to some “service layer” (possibly running on a different machine), which would then persist that data in one transaction.

Even if some attempt was made to separate things out, there likely would be some “orchestration service” that received the full set of data and it would make calls to the other “services”, passing in the specific data that each “service” is responsible for.

I am putting quotes around the word “service” to indicate that I don’t consider these proper services in the SOA sense (as they lack the necessary autonomy) – they are more like functions or procedures, whether or not they’re invoked XML over HTTP is besides the point.

What to do?

Like so many other things, the solution is simple but a bit counter-intuitive as it doesn’t follow the way most web development is done, i.e. one submit button => one call to the server.

Let’s say the “Red” service is responsible for guest information and the “Blue” service is responsible for credit card data. In this case, each service would have its own javascript come down with the page and that script would register itself for a callback on the click of the submit button. Each service would take the data the user entered into its part of the page and independently make a call to “the” server (could be to 2 separate servers) where the data is persisted (potentially to 2 different databases).

This raises other questions, of course.

Now that the data submitted is being processed in 2 transactions rather than just one, we may need to figure out how to correlate the data. In this specific case, it’s not such a big deal as there is no direct relationship between the guest and the credit card – both need to be independently correlated to some reservation ID.

That reservation ID would likely have been “created” on a button click on a previous screen by some other service. The reason why I put the word “created” in quotes is that this could be as simple as having the client generate a new GUID and put that in a cookie (which would cause the reservation ID to end up being submitted along with subsequent requests). Another alternative would be to put the reservation ID in the session.

It’s quite possible that the reservation ID would only be persisted much later in the service that owns it when the user actually confirms the reservation on the website.

In any case, what we can see is that each of the commands of our respective services can now be processed independently of the others in an entirely asynchronous fashion thus vastly improving the autonomy of our services.

Some words on CQRS

This style of UI composition where services leverage javascript code running in the browser isn’t technically difficult in the slightest. The rest of the implementation of each service – having a controller that takes that data and passes it on for persistence can be quite simple.

I’d say even more strongly, most of the time you shouldn’t need to use any fancy-dancy messaging to get that data persisted – that is, unless you’re still stuck with the big relational database behind 23 firewalls type data tier. Embrace NoSQL databases for the simplicity and scalability they provide – don’t try to re-invent that using messaging, CQRS, persistent view models, event-sourcing, and other crap.

There are other very valid business reasons to embrace CQRS, but they have nothing to do with persistence.

Also notice, this is all happening within a service boundary / bounded context.

In closing

If you aren’t leveraging these types of composite UI techniques, it’s quite likely that your service boundaries aren’t quite right. Do be aware of the UI design and use it to inform your choices around boundaries, but be aware of certain programming “best practices” that may lead you astray with your architecture.

Also, if you’re planning on coming to my course in Toronto to learn more about these topics, just wanted to let you know that there’s one week left for the early-bird discount.

Finally, it’s good I have a birthday that comes around once a year to remind me that my time here isn’t unlimited and that I had better get off my rear and do something meaningful with the time I do have. If you get value from these posts, leave a comment or send me a tweet to let me know – it does wonders for my motivation.

Thanks a bunch.

Logically Distributed, Physically Centralized

udidahan — Sun, 06 May 2012 22:56:45 +0000

When people pull back the covers on something like MSMQ, particularly its private queues (the way NServiceBus uses it), and they see that MSMQ is storing its messages in C:WindowsSystem32, well, they’re not particularly happy.

One of the reasons they worry about these types of distributed or federated queue-based solutions has to do with physical failures. The concern is that messages would be lost if there was a hard drive failure.

The preference for centralized message broker type solutions is that we can set it up on a nice RAID infrastructure that will take care of any physical reliability concerns. (Just so that we’re clear here – I’m talking about an single datacenter, possibly connected to a disaster recovery site.)

So, here’s the thing:

Virtualization

You see, in a virtualized production environment, the C drive of a virtual machine is physically in the image file of that VM, which is sitting on a SAN (storage area network).

What that means is that when a message is sent from one processing node to another, the data of that message ends up being written to the SAN, with all of its redundant disks. Even if one of the machines has a critical failure and cannot start up again, all the VMs that were running on it can be started up on a different machine without any message loss.

In fact, most virtualized environments have monitoring and management capabilities built-in so that the VMs will be brought up automatically on another machine. Even if you aren’t using messaging, there are so many other benefits that virtualization brings that you probably should be planning on putting it in, if you haven’t already.

Databases too

In fact, many people do the same thing with databases.
The file partitions on which the database server actually stores its data are on the SAN.

Think about that for a second.

All the data in messages flowing through your queues, and the data in the database, on a SAN. This gives you the ability to do a fully consistent backup of the entire system with SAN snapshots, not to mention ship those to your disaster recovery site.

In closing

Distributed solutions are often misunderstood.
Bad experiences in the past with MSMQ can color perceptions in the present.

The thing is that today’s infrastructure is set up to handle distributed solutions much better.
Developers no longer have to turn to centralized broker or database technologies to get the centralized backup and restore capabilities administrators look for.

If you’ve been avoiding NServiceBus for these reasons, give it a try. Not only will it make your life as a developer easier, combined with this virtualization thing, it will make your administrators life easier too.

A CQRS Journey – with and without Microsoft

udidahan — Thu, 29 Mar 2012 13:33:09 +0000

Update – clarification post here.

I was on a call recently with the Advisory Board for the Microsoft Patterns & Practices (P&P) CQRS Journey project where they were showing the current state of their development. Towards the middle of the call, I mentioned that I found there to be too many concerns in one place and that I had expected there to be a division into multiple sub-domains/bounded contexts/business components (BCs). The answer was that they hadn’t gotten to the other areas yet and that’s why at that point in time there was only one BC.

The conversation got a bit derailed at that point, and I was asked how I would do it (though not quite as politely), ultimately leading to my tweeting this:

MS P&P CQRS project asked me to show how I would do the conf mgmt domain My Way. Anyone want to help me show them how to do it right?

— UdiDahan (@UdiDahan) March 21, 2012

I think I got over 50 people who wanted in on this, while some of them urged me to work with P&P rather than separately. I think I’ll do both, hopefully resulting in two implementations that can be compared – one based on Azure (done by P&P) and the other based on NServiceBus (done by my guys). Who do you think is more worried

But first things first

The fundamental flaw that I see happening with many software projects (including the P&P CQRS effort) is that not enough time is spent to understand the underlying business objectives – the thinking behind the use cases / user stories. Developers assume behavior is “like” that of another/similar domain – when the difference in the details matter a lot. That often leads to software boundaries that aren’t properly aligned with those of the business.

The effects of this lack of alignment may be felt only much later in the project, when we get a requirement that just doesn’t fit the architecture we’ve set up. I’ve blogged about the symptoms of this problem about 2 years ago in my post Non-functional architectural woes.

We need to get into the nitty-gritty of our problem domain to find out what makes it special.

Not all e-commerce is equal

Anytime somebody is going to make a purchase online, developers immediately create some kind of “order” entity with a bunch of “order lines”, just like they read about in all the blog posts and books. Then, all sorts of other behavior are shoe-horned around those entities and… voila, a working system.

The domain of conferences is different – we don’t actually ship products when people register so payment concerns are very different. If our company is purchasing 5 tickets to the conference, the number of people (and which specific people) that eventually go to the conference may be very different than the people we had originally registered – there doesn’t tend to be that kind of volatility in traditional B2C retail (like selling books to people online).

It’s also quite likely that if a company is sending many people to a conference that they wouldn’t be paying by credit card – invoicing and payment may happen much later. That is no reason to block registration from completing.

Not all registration systems are equal

I understand how people can look at systems like TicketMaster and use that as a model for this system but, once again, the differences in the domain matter.

First of all, most people don’t purchase movie tickets weeks in advance – conference tickets do go on sale that far in advance. Second, if the movie you want to go to is sold out this week, no big deal, you’ll see it next week – conferences are more of a one-time/yearly deal. Third, you usually go to the movies with family/friends – if you can’t get tickets for everyone, you’ll go next week. When it comes to conferences, there is no “next week”, so whoever can go, does. Also, attendees going to a conference together are usually coworkers, not family – there are less qualms about leaving someone behind.

This is already leading us to a model where we should not view a group registration as a single success or failure affair. This will have an impact on the commands, events, and transactions that flow through our system.

In any case where people are reserving something far in advance, there is a high likelihood of cancellations. This is similar to the domain of hotels/hospitality where you can cancel your reservation up to N days before your arrival at no charge. This also tends to influence the payment structure – we’d rather not have to return people’s money as there can be per-transaction charges for that, instead delaying payment can make sense.

Similar to how hotels overbook by a certain amount (to offset cancellations), our conference might look at doing something similar. The difference is that in the case of a hotel, the guest will likely just book a room in a different hotel in the case the first hotel was fully booked. This probably won’t happen with a conference.

For that reason, we want to remember who wanted to come to our conference even when we thought we were full. You see, our best chance of filling a seat that opened up due to a cancellation is by a person who wanted to register before. What we need here is a waiting list – something that doesn’t make the same kind of sense for hotels or airlines (although airlines do use waiting lists, just that that is usually exposed to travel agents and not to travelers booking online).

First-come, first-served – fairness

The traditional developer thinking about systems is rooted in synchronous and sequential processes. In attempting to give a good user experience, developers want to give the user final confirmation as quickly as possible – whether that’s success or failure.

This results in a first-come, first-served user interaction model – whichever user registers in our conference management system first, the better the chance they’ll get what they want. That sounds like a pretty fair system, the only thing is that fairness was not a requirement.

In the real world, if people are standing in line for tickets, they’d get really upset if the tellers decided somewhat arbitrarily to serve people in the back of the line before those in the front. The great thing about online systems is that nobody can see the “virtual line” – the system can be as unfair as we like and there isn’t a real way for the users to know that this is happening.

Why be unfair?

While conferences, theaters, and airlines all want to have all seats filled, the difference between the ongoing models of airlines and theaters and the once-a-year model of the conference influence how sales are done. Some companies send a lot of employees to our conference so we want to give them preference in registration. This is area that we have the most leverage over – when it comes to the masses who arrive in ones and twos, there’s not very much we can do. It makes sense to bend over backwards for a large group, but not for a small one. A commitment from a large company tends to mean more than that from a small one.

If Boeing has already registered 70 people to your conference and now wants to send 5 more, are you really going to tell them “sorry, we’re fully booked”, or are you going to do everything in your power to keep them happy so that next year they’ll want to keep working with you? Wouldn’t it be nice if you could “unregister” some people to make room for the Beoing guys.

Now, you can’t necessarily do this up until the last minute, but potentially 2 weeks (or whatever) before the event could be reasonable, leaving people the ability to cancel flights and hotels without charges (assuming we tell them during registration that they should buy refundable plane tickets).

The easiest way to “unregister” someone is to not tell them that their registration was confirmed. In short, 2 weeks before the start of the event we finalize all registrations deciding (based on our internal priority) who gets in and who doesn’t. We may have logic that decides to immediately finalize registration from Boeing (and other select customers) without waiting until 2 weeks before the event.

Just don’t look TOO unfair

Appearance is everything. Perception matters. You don’t want to get a reputation for being unfair.

So when we open registration, we can allow the first N people to bypass our waiting list and get accepted right away (payment still needing to be handled later). At that point, you can start moving new registrations through the waiting list.

The thing is, nobody knows that you aren’t actually full at that point

Influences on architecture

I hope you’re getting the impression that this collection of scenarios is going to have a big impact on the design. It indicates to us which parts of the business need to be 100% consistent with each other and which parts can be eventually consistent – ultimately defining where one bounded context stops and another begins. This has a direct impact on the events that we’d end up with – who would publish what, and how many others would subscribe to it.

I know some people will look at the above scenarios and say “but what if the requirements were different?”. The thing is that not all requirements are created equal. In working with our business stakeholders, we need to identify which elements are stable and which are potentially volatile and, yes, that’ll be different in each project. We want to align the main boundaries of our software with the stable business elements.

And don’t even try to create a system so flexible that it could handle any new requirement without any architectural changes – down that path lies madness. User-defined custom fields used in user-defined custom workflows, all of it appearing in reports with sorting, filtering, and grouping. You might as well give your users Visual Studio.

Back to P&P

I don’t know if P&P will adopt this set of requirements for their CQRS Journey. The thing here is that we can see the collaborative nature of the domain quite clearly – multiple actors working in parallel where the decisions of one affect the outcomes of another.

The requirements that I’ve seen being handled in the CQRS Journey so far don’t seem complicated enough to justify anything more than a 2-tier architecture – it’s feeling somewhat over-engineered right now. I know that people in the community see other benefits to CQRS but I’ll have to put up a separate blog post describing why there are other better solutions than CQRS most of the time.

Anyway, I’m willing to see how things progress and tweak these requirements (up to a point) so that both the NServiceBus solution and the Azure solution are addressing the same problem.

In closing

Occasionally I hear people still raising the agile mantra against Big Design/Requirements Up Front. The thing is that Agile Manifesto never said to intentionally bury your head in the sand with regards to the purpose of the system. It was a push-back against spending months in analysis without anything but documents coming out, but the goal was to reach a middle ground. Nobody ever said “no design up-front” or “no requirements up front”.

I’m going to try to work with both P&P and the alumni of my Advanced Distributed Systems Design course to come up with simplest possible solution that addresses the requirements (functional and non-functional).

Hope you’ll find this journey interesting.

Update – clarification post here.

Don’t try to model the real world, it doesn’t exist.

udidahan — Mon, 05 Mar 2012 14:22:42 +0000

Recently I’ve started talking more about modeling and its relation to the real world.

Here’s where it all starts from:

Don’t try to model the real world, it doesn’t exist.

I know that that sounds like a very Matrix-y kind of statement, so let me explain.

The “Real” World

The problem with the “real” world is that you are limited by the laws of physics. The thing is that somewhere along the history of software development, we got this idea that if only the structure of our software represented physical reality, then our software would be maintainable, flexible, robust, … in short, good.

The thing is that a single physical entity can have multiple meanings to various stakeholders.

Let’s look at something simple, like a glass:

From a developer’s perspective we might call that a Product and not think very much more of it. We’d be happy that we could come up with a single abstraction that allowed us to model all the different kinds of products the same way.

Yet, in talking with our business stakeholders, one might call it inventory, another might call it a liability (think of breakage requiring insurance), and another call it merchandise. The important thing to note is that the data relevant to each of those meanings is so different from one stakeholder to another.

And that brings me to “customer”

One of my least favorite entities – a lingering symptom of the Northwind disease.

When someone walks into your store for the first time (whether that store is physical or virtual), are they a customer? Even if they haven’t ever bought anything? Even if you don’t know their name? Are they even a User then? I mean, it’s not like we’re going to force people to login (or create an account) just to browse the site, right? A term like Visitor, Prospect, or Lead sounds like it would describe this type of concept better.

After wandering around your store for a while, they come up to you and ask for help finding something. If this pattern repeated itself over and over again for the same category of item, would that be meaningful to the business? Don’t you think that should be modeled? I hope your answers are yes, and yes. This is the domain of merchandising, and seems more related to Visitors than to Customers.

Let’s say your selling to other businesses rather than to consumers. In the B2B space, it is common not to receive payment for goods or services for some time – you might have heard terms like Net30, which means you will be paid up to 30 days later (in some cases, this may be from the end of the month of the invoice rather than the date of the invoice).

If you talk to the business folks in charge of these scenarios, you’ll hear them talk about Accounts Payable and Accounts Receivable. Yep, they are the accountants. If you were to go about building a DDD Ubiquitous Language, it sounds like the term Account would be a better choice than Customer. The thing is that accountants use the same language regardless of how quickly an account is settled – like if payment is done by credit card at the time of purchase.

There is no Customer.
There is no Product.

The same goes for so many other problem domains.

I know it feels counter-intuitive to not have a single class representing a single physical thing. It feels like it’s the exact opposite of Domain-Driven Design. It feels anti-object-oriented. But remember, most stakeholders you talk to don’t focus on the physical elements either.

The one thing left to be modeled from “reality”

And that’s identity.

It would be most accurate to say that the physical thing you perceive is nothing more than identity serving to correlate all the separate business concerns to each other. It’s this ID that ties the Visitor on the site, to the Account, to the Addressee (for shipment).

These IDs are needed primarily for reporting and UI reasons – it isn’t likely to have a business action operate on entities correlated this way in the same transaction.

Nouns, Verbs, and Reality

In building your ubiquitous language, look past the nouns and verbs visible on the surface.

Watch out for statements like “in reality…” and “in the real world…” as they are really just one person’s interpretation of their perception of reality. Not one of us is able to see reality clearly – it’s all just perceptions. Recognize that, like models, all perceptions are wrong, but some may be useful.

Model the perceptions – at least you can have first hand experience of those.

Forget about reality – all that exists is perceptions.

In closing

Transcend the physical.

In software there is no gravity, no mass, and as many dimensions as you choose to create.

Break free of the Matrix.

You are the god of your software.

Common CQRS Abuses

udidahan — Sun, 26 Feb 2012 16:23:34 +0000

Abuse #1

“I’m using CQRS because I need to scale.”

While CQRS may be more scalable than other more traditional architectures, the use of asynchronous communication often complicates the user interaction model causing users to not see the changes they made to data in the UI until later. Trying to compensate for this (by writing even more code) digs one deeper into the complexity hole.

When I point to non-collaborative subdomains and state “You don’t need CQRS for that”, the reason is that in these areas you don’t tend to have much read/write contention. While multiple users/actors may be working in parallel, they don’t touch the same set of data (or do so only very rarely).

In these environments, all you need is a scalable data storage technology – something designed to scale-out (unlike most relational databases). This can take the form of NOSQL databases like HBase and Cassandra. Often all you need is the UI to query that directly and show the results, and the same goes for persisting the data back – possibly with some basic validation and calculation code on the side.

No commands, events, DTOs, publish/subscribe, domain model, etc.

As Ayende says – JFHCI, just f-ing hard code it.

You’d be surprised how much of your data this approach can apply to.

With the time you save on all the less important stuff, you’ll have more time to apply CQRS the right way for the high-value/high-complexity parts of your system.

***

Just a final note, as registration for my course in New York is coming to a close in 2 weeks, I wanted to let you all know that the price for the course will be going up this April, after the course in Sydney. The reason for this is that the courses I run myself (at the current rate) have been cannibalizing attendees from the partner companies I do the course with.

I’ll be providing significant discounts to independent consultants (and others paying their own way) to try to keep things fair. Hope to see you there.

Go to the registration page.

Udi & Greg Reach CQRS Agreement

udidahan — Fri, 10 Feb 2012 21:01:32 +0000

Hard to believe, isn’t it?

Although both myself and Greg have been saying (quite publicly) for a long time now that we’re in agreement in about 99% of the DDD/CQRS content we talk about, it turns out the terminology we use has made it very difficult for everybody else to see that.

Anyway, on a recent call with Greg and the Microsoft Patterns & Practices team working on the CQRS guidance, I think we finally ironed out the terminological differences.

First of all, both of us clearly stated that CQRS is not meant to be the top-level architecture of a system.

The use of Bounded Contexts from Domain Driven Design is a good way to *start* handling that top-level.

The area of some contention was how big a Bounded Context should be. After going back and forth a bit, Greg brought the concept of Business Component into the conversation, and that really cleared things up all around. I was quite pleased as I’ve been going on and on about these business components for years (I think 2006 was one of my earlier posts on the topic, though the mp3 has disappeared since then).

Anyway, here’s the meat:

A given Bounded Context should be divided into Business Components, where these Business Components have full UI through DB code, and are ultimately put together in composite UI’s and other physical pipelines to fulfill the system’s functionality.

A Business Component can exist in only one Bounded Context.

CQRS, if it is to be used at all, should be used within a Business Component.

There you have it – terminological agreement in addition to the philosophical agreement that was always there.

You can find the history of my posts mentioning Business Components here.

The Myth Of “Infinite Scalability”

udidahan — Thu, 29 Dec 2011 09:58:52 +0000

Scalability is a topic near and dear to my heart.

Many a client seeks me out for the first time for help in this area.

Usually the request is for an amount substantially smaller than infinity.

It’s usually on the discussion groups and in conference presentations that infinity is brought into it.

The basics

The first issue with scalability is the use of the word as an adjective: scalable.

“Is the system scalable?”

Or the similar verb use: “Does it scale?”

The problem here is the implication that there is a yes/no answer to the question.

Scalability is not boolean.

Linear Scalability

When people talk about scalability, or a system being able to scale, they’re usually referring to a graph that looks something like this:

The red graph indicating a system that does not scale well, the green graph indicating one that does.

What is missing from this diagram are the labels of the axes.

The Y axis is Cost, Expense, or Money.
The X axis is usually the number of users (for internet-type companies).

Ultimately, scalability is a cost-function that will tell us how much it will cost to have the system support a certain number of users.

Linear scalability is when the cost of the next user is the same as the cost of the previous user. This means our system doesn’t have bottlenecks. This is what people usually mean when they say “infinite scalability”.

But there’s more

As many of the internet companies (and their investors) have realized over the years, there’s a difference between the number of users and the number of active users. It’s very easy to scale to a billion users when only 1000 of them are active at any given time.

To be more accurate, what we want is additional X-axes for things like total data managed by the system, number of requests per user, resource utilization per request, propagation speed (how quickly information entered by one user needs to be visible to others), and more.

Scalability is a multi-dimensional cost function, where part of an architects job is to figure out which dimensions are significant for the system/business, and what the expectation for growth is across each axis.

Preparing for “infinity”

Be careful not to optimize for only a single dimension – reality is a whole lot more complex.

There are so many other things to deal with as a system scales.

For example, do you really think you’re going to want your configuration entirely centralized? Putting everything in one place means easier management, yes, but it also means a mistake will instantly affect everyone. Is it worth the risk? Maybe instead of centralization, we could do with some automation that will allow a staged rollout of configuration changes with the ability to rollback.

The same goes for rolling out new versions, patches, and upgrades.

But that now means we may have multiple versions of the same system in production at the same time. How will that work? Will they all talk to the same database? How will we version the database then? If not, how will we handle state? Won’t this mean our code will have to be backwards compatible from one version to another? Isn’t that hard? Like, insanely hard?

Please, can we park the whole “infinite scalability” thing?
It’s really not the most important concern – not by a long shot.

Recording of joint interview with Eric Evans

udidahan — Thu, 01 Dec 2011 04:59:14 +0000

Last month both myself and Eric Evans spoke at a conference run by the International Association of Software Architects (IASA) in Madrid. Eric talked about DDD and I talked about CQRS. While the talks were recorded, I don’t think they’ve come online yet.

At the end of the conference, we were interviewed by the local .NET magazine dNM and that video is now available here. We covered the background on things like DDD, CQRS, and the Cloud. I don’t think that either of us said anything earth-shattering but if you have half an hour, take a look: