I just ran into a problem with SRM version 126.96.36.199. And I would like to tell you more about it.
I recently was involved in building a VMware environment with SRM as Disaster Recovery solution. Because this was a first for this organization we did a proof of concept (POC) before building the production environment.
The POC environment was build on HP Gen8 Blade servers
The SAN we used was IBM Storage Volume Controller
vSphere 5.5 Update 1 is used (ESXi and vCenter)
SRM 5.5.0 or 188.8.131.52 (not sure anymore)
After building the POC environment it was time for testing SRM. After some errors related to unmounting datastores (maybe I’ll write a blog about that later) the recovery went fine. We executed the recovery plan many times (+50) without receiving any errors. So after all functional tests have been done it was time to recreate the environment so we deleted everything and started building from scratch.
I rebuild the entire environment using the latest available (minor) versions.
vSphere 5.5 Update 2 is used (ESXi and vCenter)
Although SRM 5.8 came out we did not used this version because SRM 5.8 uses the web client and not the .NET client anymore and our POC was based on SRM 5.5 and all documentation was based upon SRM 5.5.
After the rebuild it was time to test SRM again. The first couple of recovery’s went fine, but after that I received errors. The strange thing here was these errors are not consistent, meaning they don’t show up on every recovery. The error causes some virtual machines to loose there network mapping and prohibits them from starting.
Error – Unable to copy the configuration file ‘[DatastoreName] SRM-03/SRM-03.vmx’ from the host to ‘C:\Users\serviceaccount\AppData\Local\Temp\vmware- serviceaccount \SRM-03.vmx72-101’ Failed to copy file ‘[DatastoreName] SRM-03/SRM-03.vmx’ to ‘C:\Users\ serviceaccount\AppData\Local\Temp\vmware- serviceaccount\SRM-03.vmx72-101’: 11 – The session does not have the required permissions.
After searching the web I found this knowledgebase article from VMware and thought YES this is a known error and the solution is nearby.
Sadly the solution provided by VMware did not resolve the problem. I created a case by VMware and at first they focused on increasing the settings mentioned in the knowledgebase. They thought it has something to do with the SAN not being ready after the LUN’s where promoted. This resulted in an increased recovery time. In our POC environment a recovery took 15 minutes, now with all the timeouts it took 35 minutes. This was not acceptable, and luckily also not the solution.
After the case was escalated\transferred to the engineering team the discovered a timing problem between the ESXi, vCenter and Storage Array. I don’t mean a timing error like clocks are out of sync or in a different timezone, I mean a timing issue in the SRM runbook.
VMware believed these errors were solved in SRM 184.108.40.206, but then realized these solution where not implemented, only in SRM 5.8.
So after this the solution was simple: Upgrade to SRM 5.8. I did just that and after testing the recovery (50+ times) I’m glad this issued is resolved.
Test recovery plans multiple times (not 10 times really 30 or 40 times) to be sure they are working.
After upgrading components test SRM again.
Thanks to VMware support for helping out!