Sunday, August 17, 2014

Aum Cluster or Cloud?

Disambiguation and some FAQ

What is a cloud-system? How does it relate to clustering? What about server virtualization?

There are many definitions, each bringing their own little flavor, leaving some confusion aftertaste. If you inoculate yourself against correctness angst, trendy words and "term juggling", become nonchalant and just look at facts then you would realize that there is an entanglement of a completely orthogonal ontologies.

Virtual Servers

At first let's get virtualization out of the game. One can build a cloud/cluster/large system without any virtualization whatsoever. The confusion comes from the fact that many service providers these days offer "clouds" as a dynamic sets of "virtual"(not real) logical computers that customers can create/delete/start/stop. Those computers are not real hardware, there are one way or another emulated as-if being real. Of course those "logical" computers run on some physical machines, but this behavior is transparent. The point is - "virtual/ization" neither does nor does not automatically make your application for-"cloud". Your app must be created for cloud regardless of virtualization.

Virtualization is good for cloud apps because it allows to dynamically increase/decrease your server usage by adding/removing boxes as you need them, thus you can better utilize your hardware, BUT if your app is not created to dynamically(at runtime) adjust its participating nodes/servers then virtualization benefits get diminished.

What are "Clouds"?

A cloud is just an abstraction of "somewhere on the internet". A cloud-system is a system that runs on some servers in some data centers, NOT on your laptop/tablet/phone, although local devices of course interface with clouds, they do not store your data - less headache! Contrary to what many believe, clouds do not have to have public access (such as Facebook or Twitter), indeed many corporations built their own internal clouds that only internal resources/devices/employees can connect to ( i.e. via VPN).

Server Clusters?

As demands grow systems end up employing many servers to do some job. i.e. serving database requests or building web pages. A cluster is a set of machines that appear as a single logical system that performs some specific task. The machines are usually tightly connected in a data-center and may even span multiple geographical data centers.

Am I in the cloud now?

Simple. If all of your personal PCs/tablets/gadgets get fried today, will you lose your data/software in question? If yes, then you are not in the modern cloud. Modern cloud systems give you this benefit - just remember your ID and password, and you can continue where you left off from any machine/point in the world. This rule is for general apps that are usually web-based. Of course there are special kinds of apps (like 3d games) that would require to re-install something on the new computer, but still you would recover all of you "state" where you left off before all of your devices got lost.

Do I need to use clusters to be in the cloud?

Most likely your cloud system does consist of some form of cluster software/hardware. But the answer is NO. One may create a cloud service out of many disjoint computers (that someone else may call a "cluster")

Do I need to use virtual servers to built clouds?

Absolutely not. Any cloud service can be created without a single virtual computer

What are "cloud-apps"?

These are applications engineered to run in the cloud. Usually these are systems that know how to deal with myriads of problems that do not exist in "regular"/local apps. For example, in cloud clusters there are many servers to deal with, how does the app get configuration/connect strings to other members of the cloud? There are 100s of questions that cloud apps need to address that local(or small client/server apps) don't care about.

Can I AUTO-convert (without spending time) my existing client/server DB app into a "cloud-app"?

If you still expect to have 5-10 active user then yes, no need to convert. Just host your current client-server app on something like Amazon, and nothing needs to be changed (except for some config files). On the other hand, it is not going to be what guys like Google, Facebook, Twitter call "cloud app". There is no way to auto-convert your client-server application into a scale-able web service that services 1,000,000 customers a day. You need absolutely different architecture for that.

To Summarize: Cloud systems are in the cloud (literally somewhere else). Clustering is just a way of sticking many computers together (either physically or logically). Cloud services are usually comprised of software and hardware clusters of all sorts. They run applications that were engineered with all crazy cloud system nuances in mind(and cost a lot of $$$:(). And finally, virtualization is not a necessary (although convenient for some) requirement to be in the cloud.

Aum Cluster

"Aum Cluster" is a software library/framework for creation of massive general-purpose computer clusters that may be used to create public/private cloud-based applications. The "general-purpose clusters" means - many computers that perform app-dependent tasks, for example, unlike Oracle Database Cluster, which is a strictly-speaking just a name for Oracle's database product. Aum Cluster is a library, which means - you build what you want with it, be it particle physics simulation or online e-commerce site.

The purpose of Aum Cluster is to address 100s of very complex software problems that arise in distributed systems, so its users may concentrate on business-specific tasks. For example: things like configuration of 1,000,000s of servers, discovery, peer name resolution, unique ID gen, replicated data stores, process management and remote control, security is all factored in.

What sets Aum Cluster aside from many "cloud systems" is the Unistack approach. Unistack is a unified software library that gets deployed to all participating servers thus reducing the complexity 10-fold. I have blogged about it before.

Aum Cluster can run on either virtual or physical servers. Virtualization has no real significance when you write your app.

To Summarize: Aum Cluster framework allows you to properly architect and build huge systems (with millions of nodes) taking care of 100s of complex issues that exist in any distributed system. It is like Google/Facebook/Twitter internal mechanisms made available to any application in a general way.

Saturday, August 2, 2014

NFX.Glue - Interprocess Communication

Definition + Features

NFX.Glue - is a part of NFX framework that allows developers to quickly (much faster than using WCF/RMI or remoting) interconnect/"glue together" various process instances. In one sentence: NFX.Glue is a contract-based state-less or state-full RPC mechanism that uses messages as logical delivery unit. The core implementation of Glue is probably less than 10,000 LOC (very usual for NFX), and adds roughly 300Kb to the final assembly image.

NFX.Glue Features

  • Very Simple - to use and configure
  • Built-in NFX application container, so can be Hosted in any app type without special "service hosts"
  • Contract-based programming
  • Injectable binding types define protocol/message exchange patterns (i.e. sync blocking/async/multicast etc)
  • Pre-implemented native bindings: TCP sync, TCP async, In-process
  • Native bindings allow for transparent serialization, no need for special attributes (unlike WCF or ProtoBuf), supports objects of any complexity with cyclical references
  • Message-based. Every call turns into RequestMsg, server generates ResponseMsg for two-way calls
  • Aupports MessageHeaders for extra data (i.e. security credentials)
  • Supports one-way or two-way calls
  • Supports multilevel message filtering/inspection (glue/client/binding)
  • Supports security - guard contracts/methods/classes with permission attributes
  • Supports state-less or state-full server programming with volatile process lifecycle (allows process to restart without "forgeting" its state)
  • Proxy Clients natively provide sync and async call trampolines without any extra threads or wait queues/reactors
  • Built-in channel/transport lifecycle management - impose limits on the number of outgoing connections per host etc., how long to keep idle channels alive etc..
  • Detailed statistics - number of messages/bytes/calls, call round-trip times per contract/method
  • Performance on a 6 core machine: ~120,000 ops/sec two-way simple calls (return int as string+'hello!') via native TCP sync binding

How NFX.Glue Works

A call is originated from a calling party, like so:
   var node = new Node("async://quad:7311"); 
   var console = new RemoteTerminalClient( node );
   console.Connect("Jack Lowery");

   Console.WriteLine("The time on connected node is: " + console.Execute("time");

   console.Disconnect();
Here, we have connected to machine "quad" using "async" for binding. The calling process has a piece of config that says:
 glue
 {
  bindings
  {
   binding {name="async" type="NFX.Glue.Native.MpxBinding, NFX"}
  }
 } 
So now, the Glue runtime knows that "async" is an instance of "NFX.Glue.Native.MpxBinding, NFX" (with about dozen of parameters like TCP buffer windows etc). The original contract for the service is this:
    /// 
    /// Represents a contract for working with remote entities using terminal/command approach
    /// 
    [Glued]
    [AuthenticationSupport]
    [RemoteTerminalOperatorPermission]
    [LifeCycle(ServerInstanceMode.Stateful, SysConsts.REMOTE_TERMINAL_TIMEOUT_MS)]
    public interface IRemoteTerminal
    {
        [Constructor]
        RemoteTerminalInfo Connect(string who);

        string Execute(string command);

        [Destructor]
        string Disconnect();
    }
It is a state-full contract that initializes server instance (a terminal connection, in our case) with a call to "Connect" and then either times-out after "REMOTE_TERMINAL_TIMEOUT_MS" or gets torn down by a call to "Destructor". In this semantic, constructor/destructor is just a special kind of method that does regular method work, possibly returning some parameters but also telling Glue what to do with the instance. The "LifeCycle" is a part of the contract not the implementation, because it really dictates what other methods a contract should have/not have. Pay attention to "RemoteTerminalOperatorPermission" which guards ALL methods of this contract. A user must supply a valid token, for this "AuthenticationSupport" is stipulated.

On the server we will include in config:


 glue
 {
  servers
  {
   server {name="TerminalAsync" 
           node="async://*:7700"
           contract-servers="ahgov.HGovRemoteTerminal, ahgov"}
  }
 } 
And then implement the interface like so:
    /// 
    /// Provides basic app-management capabilities
    /// 
    [Serializable]
    public class AppRemoteTerminal : IRemoteTerminal
    {
        public AppRemoteTerminal()
        {                                                            
        .....
        }
       
        protected override void Destructor()
        {
        .....
        }

        private int m_ID;
        private string m_Name;
        private string m_Who;
        private DateTime m_WhenConnected;
        private DateTime m_WhenInteracted;
        
        
        public virtual RemoteTerminalInfo Connect(string who)
        {
          ..........
        }

        [AppRemoteTerminalPermission]
        public virtual string Execute(string command)
        {
           ....... 
        }

        public virtual string Disconnect()
        {
            return "Good bye!";
        }
     }
Notice the use of instance fields

Lets look at the following diagram:

The call is made in the client code, and then it gets turned into a "RequestMsg". The client transport makes a "CallSlot" - a type of "spirit-less-mailbox"(no threads/events) that captures a request with its timestamp and unique GUID. At the end of the call, the server sends ResponseMsg if a call is not OneWay, and the response gets matched by the RequestID into the original "CallSlot".

An interesting part of this design is the "Binding" area - it controls the means of message delivery (i.e. TCP/IP/USB/COM/LPT or anything else) and the message exchange mode: synchronous or asynchronous. In SYNC mode the message gets sent and response gets delivered in one operation akin to TCP blocking sockets. In ASYNC mode we use completion ports on Windows to establish a bi-directional traffic channel per every single socket. Those implementations are provided in "NFX.Glue.Native" namespace in "SyncBinding" and "MpxBinding"(MultiplexingBinding). "MpxBinding", which is asynchronous by definition, the sending is orthogonal to receiving, what this means is that the physical TCP channel IS NOT BLOCKED for the duration of the call execution. For example, suppose the server needs 100msec to execute some method. One can post 1000s calls using the same transport via MpxBinding, the responses will arrive as they get generated by the server. Had we used "SyncBinding" instead, we would have needed as many TCP connections as currently pending calls, however do not question the need for "SyncBinding". Blocking sockets work with much-lower call-roundtrip latency in scenarious when calls are not frequent and not highly-parallel - for example local machine clock update done every minute via SyncBinding would work much better time-wise vs. async socket/message IO (+-few milliseconds difference). So, "MpxBinding" is better for throughput and tolerable latencies for many calls (1000s/sec), whereas "SyncBinding" is better latency for relatively-seldom calls (10s/sec).

A Few Q/A

  • How does this relate to ZeroMQ? - NFX.Glue is a Contract-based/object-level message passing system, whereas ZeroMQ is byte-message oriented. NFX.Glue is a much higher-level framework designed to work with higher-level constructs conducive to solving business problems
  • Is Glue slower than ZeroMQ? - it really depends on what type of "business payload" your app is pushing. The network part of Glue is as fast as ZeroMQ as it uses basic sockets and avoids buffer copies whenever possible, but please do not compare sending byte[4] with calling a method on a remote class instance
  • How does Glue relate to Erlang? - a similar answer to the ZeroMQ question above, Erlang works with much lower-level(than Glue) data primitives - tuples, lists and the like. One can not really compare the two technologies directly as building the similar feature set in Erlang would require a significant effort (add security, permissions, state management), and Erlang uses its own communication platform (OTP) very well, however it is still much narrower in scope than NFX.Glue. Take a look at NFX.Erlang instead if you need to support Erlang/OTP from NFX.
  • Does Glue replace completely WCF? - for us YES, 200%. The whole Aum Cluster is based on Glue, because all nodes in cluster are running NFX, it is a benefit of UNISTACK concept that I described a few weeks back. If you are a corporate SOAP/WSE-consumer then NO, glue does not support it currently with native bindings and never will. One can create bindings for SOAP and other corporate bloat but there is really no need to pollute a clean NFX library with out-dated crap.
  • How do I expose a Glue contract as JSON/REST - you'd need to use JSONHttp binding for that, the one that I have not created and have no intention to create, because it has no practical value. In NFX, REST services are done much easier with NFX.Wave MVC controllers, that should expose your internal Glue services as a facade. Remember - Glue was never meant to be exposed publicly, although it could via corresponding bindings, but there is no need to create bindings just to support some standards that will never be used.