Metro Clustering on VMware

References
vSphere Metro Storage Cluster white paper released!
Stretched Clusters and VMware vCenterTM Site Recovery Manager
VMware KB: vSphere 5.x support with NetApp MetroCluster

Been encountering a lot of questions on Metro clustering on VMware as well as active-active data center setup.  I would like to bring this out to help anyone who is consider doing a metro cluster to understand what are the main requirements and the cost involved as well as the scenario cases which need to be understand before you make that decision.

Many wanted to do a stretch cluster mainly to meet a active-active data center setup.  Some wanted to have a a easier disaster recovery.

I must clarify some misconception here.  Metro clustering is Downtime Avoidance (DA).  It is not disaster recovery (DR).  DR has a downtime to recover.  Metro clustering is a active active setup for near zero setup if storage is synchronization.

Some realistic facts:
  • Metro clustering can breaks.  So when it breaks do you have a backup plan?
  • If one site to break, are you able to make sure the whole environment can be up and running in within your RTO?
  • Shifting workload from primary to secondary and bringing down the primary site is not a DR exercise.  A real simulation of a site failure need to be executed to know the scenario and actions taken to bring up the environment or resolve any issues.  By shifting workload from site to site does not simulate that scenario.
Do ask this again, do you need a near zero setup?  Can other solution e.g. RAC, VCS, etc. provide the same? Before embarking on it, do you know storage replication will dependent heavily on network.  Metro clustering cannot be achieve via fiber connection unlike your typical within site replication.

After the above, if  you still decide to go ahead into metro clustering, here are the test cases you should be very well verse and I would believe you should have 3 expertise in your company, i.e. Network, Storage and VMware.  In case, any of the issues were to arise, you can identify and immediate resolve it.

Requirements:
  • A metro cluster storage solution e.g. EMC VPLEX, Netapp Metrocluster, IBM Storwize, etc.
  • A stretch layer 2 network using e.g. Cisco OTV
  • Three sites for ideal scenarios or two site with more complex recovery plan.
  • A big pipe for network to replicate all your data over.
  • Network link latency between two site must be less 5ms.
  • Intersite connection cannot exceed 100KM.
  • Lots of money :)

Below are just some scenarios to take note of.  Do note this is not a complete used case as there can be more depending on the type of Metro clustering storage used and also the impact of vSphere differ from 4.x and 5.x.

Typically a complete setup would need at least 3 sites.  2 for the active active setup while one as a "witness" module when a network partition happens to decide on which site is the Primary.  This 3rd site would also be the Disaster Recovery site to complete your DR plan.  Which can be manual or automate using SRM.

One important thing to note, any vApp that consist of multiple tiering, restart priority are not taken into consideration.

If this setup breaks, the worse case scenario is not listed below but the recovery can be as long or more than a tape recovery due to its complexity which I am not able to imagine.

 
Scenario
Description
VMware
Storage backend path failure (Any)
All redundancy best practices for storage path should be practiced.
No impact.
Storage frontend path failure (Any)
All redundancy best practices for storage path should be practiced.
No impact.
Storage Array failure (Primary)
Metro cluster storage will be online supported by Secondary site array.
No impact.
Storage Array failure (Secondary)
Metro cluster storage will be online supported by Primary site array.
No impact.
Front end switch failure (Primary)
All redundancy best practices for storage network swtich should be practiced.
No impact.
Front end switch failure  (Secondary)
All redundancy best practices for storage network swtich should be practiced.
No impact.
Storage failure (Primary)
Metro cluster storage will provide access for storage from Secondary site.
No impact.
Storage failure (Secondary)
Metro cluster storage will provide access for storage from Primary site.
No impact.
Site failure (Primary)
Storage access will be from Secondary site.
HA takes place on Primary site as ESX servers are lost and power up on Primary site not violating Admission Control.
Site failure (Secondary)
Storage access will be from Primary site.
HA takes place on Secondary site as ESX servers are lost and power up on Primary site not violating Admission Control.
ESX failure (Any)
No Impact.
Normal HA takes place.  vSphere 4.x, not more than 4 hosts failure. vSphere 5, no dependency.
ESX management network failure (Any)
No impact.
HA takes place where ESX fails.
All Path Down (APD)
No impact.
ESX 4.x, may require reboot of ESX. Refer KB1,KB2. ESX 5.x, depends. Refer KB.
Metro Cluster Storage Network failure
Network partition.  Refer to storage vendor is if there is a preferred site module to be place on 3rd site to.  If not such feature is used, learn the best practice on how to restore storage and manually shutdown one of the site.
Depends on setup.
Both site failure
Identify the correct procedures to power up and restore both sites storage.
Take note of VMs start up priority.  HA priority does not take care of which VM will power up faster even place in priority.
vSphere inter-site management network failure
No impact.
Network partition on vSphere layer.  HA will restart VMs that are on one site but not on the other site with failure since it data is on one of the site.
Metro cluster storage failure & vSphere inter-site management network failure
Network partition on metro cluster storage.
Follow the storage procedures to restore or suspend IO on specific site.
Split brain on vSphere.
Lots of work.  Understand if you are vSphere 4.x or 5.x and understand the reaction to happen.


I must place a disclaimer here, I am not a network nor a storage expert.  There will be more to this due to storage or network failures.  Speak to your network specialist and storage specialist on this.

Know what you are going into before making your choice.   There is only so much your engineers can do.

Simplicity or complexity.  DA or DR?  What do you think?


Update 16th Apr 2013
Added reference vSphere Metro Storage Cluster white paper released!
Post a Comment

Popular posts from this blog

vCenter 5.1 with SSO: Installation

VMware VCIX Certification Clarification

Credit to VMware Certification