Wednesday, December 21, 2016

HADAMS - Hadoop Automated Deployment And Management Stack - Part I


PrathamOS will soon be having the Server Edition "TreeBeard" based out of 64 BIT Hardened CentOS.

Consequently, it would be worth to combine the effort with facilitation for the Hadoop Stack as well. 

In the next few weeks we will explore

HADAMS - Hadoop Automated Deployment And Management Stack

This will build up to the OS release planned next year...

All product names, logos, and brands are property of their respective owners. 
All company, product and service names used in this website are for identification purposes only. 
Use of these names, logos, and brands does not imply endorsement.

I guess, the most shocking part of the above Stack for many, is the complete omission of a particular Project named SPARK

Yes,it is correct.We wont be using it (at least not in lieu of what is on offer right now). Also is missing traditional map-reduce, not happening here as well.

Let us check out the alternatives for Spark functionalities..
  • Spark SQL - Hive with Tez Engine
  • Spark Streaming - Well technically since it happens in batches so it is not "streaming", but still, let us assume that Spark "streams". We will be using Flink
  • MLlib - Mahout
  • GraphX - Giraph (checkout what facebook has to say about it)
At this point all I would say is that there is enough content on the web to compare above choices in terms of functionality, and most importantly maturity.Also, if at any juncture, there arises a need for "In Memory" Computation, we have Ignite in the stack to assist us.

Mapreduce will be handled completely through Hive/Pig.No one should be required to learn java/scala to get on the BigData train, to start off at least.Another tool of interest could be DMX-h but we will look at it later.

We need to identify where Hadoop fits into the Project Requirement Cycle, too early & it is unproductive.In fact it is counter-productive and expensive to say the least.So it is very important to understand that Hadoop is not some new coding language or framework which can just be applied to any project...just because we want to try "new tech on the block".

Let us proceed with capacity definition.The above image showcases some of the vendors facilitating hardware to host BigData.Of course, there are many others.

Cloud or Bare Metal, the choice is yours...Me, being the author of this blog, will exercise my right to "bias-ness" and choose LeaseWeb, primarily because they are Platinum Sponsers of Apache Foundation, the parent organization which keeps people like us in business.And besides I am not a "Cloud" Person.

Cloud CPU’s are not comparable to data center CPU’s and clusters may have different scaling properties and behave quite differently for different kinds of workload. While it is a cheap way to get started fast, it tends to be very expensive per hour.Possibly the biggest opportunity in the cloud is that EC2’s S3 storage is dirt-cheap even compared to HDFS, and modern Hadoop supports it as a storage format. For certain applications, particularly applications with a large proportion of cool storage, this can be compelling, but it is usually not advantageous for typical Hadoop workloads because S3 is less nimble for access as DAS.It is very difficult to ascertain the actual machine cost of running large clusters with conventional storage in the cloud, but one can assume that any cloud cluster, dollar for dollar, will be severely I/O bound, with relatively poor SLA’s and high hourly cost.Another hidden cost of the cloud is the lock-in. Writing data into EC2 is free, but the cost of backing out includes $50,000 per petabyte for EC2 download charges. The real cost, however, is likely to be in the sheer time it takes. Network bandwidth from S3 will probably be 10’s of MB/second, but say you can sustain 100MB/sec. At that rate, it would take 10 million seconds, i.e., 115 days per PB. Whatever the download rate one must include the cost of operating two storage systems for that period of time, etc.

Preferred Location for a global "deployment" would be their NetherLands center at Amsterdam.

Server would be Intel WorkHorse Xeon E5 2420.Read about choice as AMD here.Another interesting choice is AMD Opteron 6272.

At the end of the day, the idea is to get the best returns in terms of Data vs Storage, so that we remain true to the concept of "running commodity servers".

Click on the links below to check out the mentioned configuration.

Let us try and understand what really "BIG" data can be through the example provided.For that, we will have to understand Basics & Hardware limits first. 

Production Cluster Setup Guidelines 
  • Compute-intensive workloads will usually be balanced with fewer disks per server.
    • This gives more CPU for a given amount of disk access.
    • One can use six or eight disks and get more nodes or more capable CPU’s
  •  Data-heavy workloads will take advantage of more disks per server.
    • Data-heavy means more data access per query, as opposed to more total data stored.
    • More disks gets more I/O bandwidth regardless of disk size
  • Network capacity tends to go up with high-disk density
    • Jobs with a lot of output use higher network bandwidth along with more disk.
    • Output creates three copies of each block; two are across the network.
    • ETL/ELT and sorting are examples of jobs that:
      • Move the entire dataset across the network
      • Have output about the size of the input
  •  Storage-heavy workloads
    • Have relatively little CPU per stored GB
    • Favor large disks for high capacity and low power consumption per stored GB
    • The benefit is abundant space.
    • The drawback is time and network bandwidth to recover from a hardware failure
  • Heterogeneous storage can allow a mix of drives on each machine.
    • Just two archival drives on each node can double the storage capacity of a cluster.
    • The prolonged impact on performance of a disk or node failure has to be addressed though.

As the guidelines above suggest, Hardware configuration will depend and thereby differ based on the case study.However let us discuss generic deployment for "typical" work-flows.
  • Normal real-world Balanced Compute Configurations usually have dual hex-core CPU’s, 128-192 GB RAM & 12 2TB disks directly attached using the motherboard controller, with bonded 10GbE + 10GbE for the ToR switch.These are often available as twins with two motherboards and 24 drives in a single 2U cabinet.
  • Light Processing Configuration would have 24-64GB memory, and 8 disk drives (1TB or 2TB).
  • Storage Heavy Configuration has 48-96GB memory, and 16-24 disk drives (2TB – 4TB). This configuration will cause high network traffic in case of multiple node/rack failures.
  • Compute Intensive Configuration normally has 64-512GB memory, and 4-8 disk drives (1TB or 2TB).
  • Other points of mention include that the node has to be optimally utilized, i.e. complete disk and ram should be possessed for the "great churn" and because Hadoop supports replication,starting with a recommended base minimum of three nodes, no process of implementation facilitating less than the same should be undertaken.
Recommendations (Balanced Computation) (Others Adjusted Accordingly)
  • When to start
    • Disk space available (3Nodes * 24 TB) = 72 TB.
    • Non HDFS Space 2%. Cluster Nodes should not be used for shared services by default.Which means that for OS operations, logs etc. 2% of the total space available should be OK.In this case, it would amount to 480GB per node.One can also set a fixed quota for the same as well,say like 1TB or something like that.
    • Processing Space 23%. In BigData environment logic goes to the data and not vice-versa.All processing is done on the datanode itself and space is required to hold that intermediate result set.This could vary upon specifications of the case study, usage of something like Spark or Ignite would mean that more memory is required then disk.
    • Total Available for HDFS considering above (75% of 72 TB) = 54 TB.
    • Assuming that 10% Data in HDFS could be derived, inferred or aggregated from raw data and saved in HBase or other formats, actual space for Core HDFS (90% of 54 TB) = 48 TB.
    • Taking Replication factor as 3, Actual Data Input (48 TB / 3) = 16 TB.
    • Just in case, an algorithm like snappy is used, which provides up to 40% compression, Data Input could be accommodated up to an upper limit of (16 TB * 140%) = 22 TB.Processing time would scale incrementally though, and must be ascertained beforehand through sample benchmarking.
    • Midway between raw vs compression makes the start point as 19 TB.
More than the Data, the critical part is the Data Pattern.It has to be consistently high, grow at a healthy rate, preferably at a compounding one.

As for our example, we are currently at 3.5GB/1$ which doesn't sound very attractive but a majority of those expense are capital in nature and in long term, if you think in terms of the business growth we would achieve by managing 100GB/1Day, per node addition would start bringing down the costs to profitable proposition.Of course the complete exercise has just been a generic walk through and there are still several things that can be done to ensure optimum efficiency.For example we have taken into account annual contracts only, whereas there are further discounts if we go for say, 3 years, and then, there are bulk purchase discounts as well...

All in all, it should be quite clear by now that BigData implementation is more about the proper configuration and a compact relationship between data and storage.

With this, we come to the end of Part I.

Next we will see how to set up a test environment for simulating HADAMS so that we can have hands on knowledge on the scheme of things.

No comments:

Post a Comment