Some time ago I’ve designed and built an infrastructure on behalf of my client to support the services they deliver to their customer. This environment is spread between two datacenters. Each site has a vCenter server running and a couple of ESXi hosts to support the workloads. Both sites run part of the workload and contain spare capacity in case one site fails.
Veeam Backup & Replication is used to create backups and replicate the VM to the other site each hour. During the initial configuration and testing of the backup and replication functionality we did not run into any problems and were happy with the results.
The reason behind this blog is that when the workload started to increase we encountered some issues and finally needed help from Veeam support (twice) to solve these problems. In this blog I will discuss our setup, the problems we faced and off course the solution.
As you can see the environment consists of 2 vCenter servers which are linked together using Linked Mode. Each site holds an ESXi cluster. vSphere version 5.5 Update3 is used.
As mentioned earlier we use Veeam replication (v9.0 Update 1) to replicate the VM’s every hour. This includes the Veeam Backup Server. Veeam Backup Server 1 is used as our primary backup server. All backup and replication jobs are configured on this server.
Veeam Backup Server 2 is the “stand-by” server. Veeam Backup Server 1 stores configuration backup data on this server. This server is only needed when the replica of the Veeam Backup Server 1 is unable to start.
Backup Proxy 1, 2 and 3 are used for the replication process. Veeam Backup Server 1 also hold the Backup Proxy Services components.
Veeam has three transport modes which it can use to access data, these are:
- Direct storage access
- Virtual appliance
Direct storage access – can only be used by the Backup Proxy servers on the source site. With source site I mean the site the source VM is running. This is because the storage to each ESXi server is directly attached.
Virtual appliance mode – can be used because with this mode the HotAdd features from VMware can be used. The Backup Proxy server attaches the disk from the source VM to itself. Data is retrieved directly from the datastore.
Network – This is the least efficient way of transferring data and is not recommended by Veeam.
The Problem we faced
When initially setting up this environment and testing the replication and off course the disaster recovery options, everything worked as expected. Later on when the load started to increase the replication jobs started to show errors: Error: Detected an invalid snapshot configuration.
Once this occurs every following replication job will show the same error for this particular VM. The only way to get replication going again was to remove the replica form disk using the Veeam console. After we did this the replication was working properly again. Over time the issue occurred again and the number of VM’s impacted starting to increase while the time between errors was starting to decrease. It was at this point we first contacted Veeam support.
Veeam support analyzed the log files and could not find anything wrong. I started to suspect the ESXi servers and fully patched them (there were new patches since the initial installation), rebooted them but it did not solve anything. Update 2 of Veeam was just released and Veeam support suggested to install this update. After installing this update and rebooting the server replication was working again with no issues. Well at first without issues. After a couple of days the issues occurred again but not that often as before.
After contacting Veeam support they again analyzed the logs and asked if we could use the Network Transport Mode. We agreed, but only temporarily because this mode is the least efficient way of transferring data and Veeam does not recommend this mode. While using the network transport mode several days we did not encounter any errors but the replication process took longer. After these results Veeam support recommended to create an additional Backup Proxy server in site 1 and to disable the Backup Proxy Services components on the Veeam Backup Server 1. By doing this the Veeam Backup Server 1 is no longer being used as a Backup Proxy.
Veeam support gathered the following information and explanation:
Veeam support found the following error messages in the log files:
[21.12.2016 21:32:36] <73> Error Processing VMNAME
[21.12.2016 21:32:36] <73> Error Failed to delete replica vm ‘[name ‘VMNAME_replica’, ref ‘vm-89382′]’ working snapshot ‘snapshot-89849’ (System.Exception)
According to Veeam support this is the moment when the replica VM got broken. There are three common reasons why snapshot deletion may fail:
- Datastore getting low on space due to open snapshots, which causes inability to consolidate deleted snapshot
- Enabled Windows automount on hotadd proxies
- Veeam server and hot-add proxies being backed up in parallel with jobs where they are used
The first two reasons do not apply and the third reason does apply because we replicate the Veeam Backup Server 1 and it is also being used as a proxy.
Backing up Veeam server and hot-add proxies during the time when they are being used as hot-add proxies is not a good idea according to Veeam support, it may work but may also cause different issues at the moment when snapshot is being removed from them (during snapshot removal a VM may stop responding for a few moments: https://kb.vmware.com/kb/1002836 ).
Veeam Server is a management server, which manages all requests (including disk mount\unmount for proxies\snapshot removal) and it’s better not to backup\replicate it at all. If backing up\replicating your Backup Infrastructure servers is necessary, you should isolate them in a separate job and reschedule not to overlap other jobs.
As a result we ended up fully patching our ESXi hosts, installing Veeam Backup and Replication 9.0 Update 2 and created an additional Backup Proxy while disabling the Backup Proxy services usage on the Veeam Backup server 1.
This seems to be the final solution as we have not seen any replication errors for weeks now.