An overview of troubleshooting techniques when diagnosing issues with a Smoothwall Failover or High Availability (HA) installation setup.
The Smoothwall Failover feature makes use of two Smoothwalls connected via a dedicated heartbeat interface where system health and availability are monitored.
One is always referred to as the “main”, the other the “failover”, with both using an active / passive mode:
- Active mode All interfaces are active. The Smoothwall passes traffic and manages internet connections.
- Passive mode Only the heartbeat interface is active. Used to monitor the status of the active system.
When a failover event occurs, the main connection to the failover is severed. The failover unit takes over by activating all configured network interfaces, and issuing a broadcast on every interface to let other systems on the same physical Ethernet segment know to remove the MAC address associated with the main’s IP address. Then all services are made active on the failover unit, turning the failover unit from passive mode to active mode. The entire switchover should take less than one minute.
See our help topic, Adding a connection to the heartbeat interface.
Things to Consider when Installing
- Configure the main first.
- Use the same software serial number for the failover unit as was used for the main.
- On the failover unit, use a temporary IP address to add the gateway and DNS settings before installing the failover archive.
- If configuring a failover setup with an existing live main unit, ensure the failover unit has the same update and release level as the main unit before installing the failover archive.
- If installing a failover unit at the same time as the main unit, you can install the archive at installation time.
- If using a USB stick to transfer the failover archive from main to failover, the USB stick must be formatted as FAT32.
- Before importing the settings from the failover archive, disconnect all network cards except for the heartbeat interface this is the only network card that should be connected at import time. When the failover archive is imported, the network settings will take effect and a reboot will be required so to prevent any issues with duplicate IP addresses on the network, it is recommended all cards be disconnected.
Finalizing the install
While the failover system is rebooting, log into the administration UI on the main and go to Reports > Realtime > System and select Heartbeat from the Section drop-down list. When the failover system is coming online messages should appear in the heartbeat log showing the main can communicate with the failover unit. Various messages will appear showing that settings have been transferred and that both main and failover have taken control of the resources used by the HA.
Once the failover unit is up, try to SSH to the failover unit from the main:
# ssh –p 222 10.99.0.2
This should complete normally and allow root login on the failover unit. Once logged in issue the command:
# ip addr
and check that only the heartbeat interface is listed as being active. That shows that the failover system is communicating with the main and has gone into passive mode as the main is alive and well.
Once this is confirmed, cable the failover as the main has been cabled. A failover test can now be done.
Testing the Failover Unit
With both main and failover unit cabled and the failover unit is confirmed in passive mode, a failover test can and should be done.
On the main unit, go to System > Hardware > Failover and click Enter standby. You will almost immediately lose connection to the administration UI as the main goes into passive mode.
Wait 10 seconds and try to access the administration UI again – you should new see the failover system administration UI. The failover system will always have a warning message showing that the system is the failover system and also showing a timestamp if the latest transfer of settings from the main.
Check that the failover unit is now passing traffic and all services are working as expected.
Testing Failback
Once the failover has been tested and all services are running on the failover unit it’s time to fail-back. First try to navigate to the main unit’s administration UI using port 440: https://<smoothwall.domain.name>:440
. Port 440 redirects over the heartbeat interface to the passive Smoothwall, which should currently be the main.
Once the administration UI to the main over port 440 has been confirmed, navigate back to the failover system and go to System > Hardware > Failover and click Enter standby.
Wait 10 seconds and refresh the page – the main administration UI should be the UI that is now being shown. Test that all services are running.
Note: Their status of active and passive can change – when the main is active, the failover is passive and vice versa. The passive unit will only have one enabled interface, the heartbeat interface. The active unit will have all configured interfaces enabled.
Issues: Split brain syndrome
Split brain syndrome occurs when both systems are in active mode. If both systems are in active mode the symptoms will be:
- Intermittent internet
- Accessing the administration UI alternates between main and failover unit on the same port number
If the split brain syndrome occurs, disconnect all network cards on the failover unit apart from the heartbeat interface. We need the heartbeat connection in order to access the failover unit and we need to disconnect all other network cards to avoid network confusion due to duplicate IP addresses on the network.
Once the failover has been disconnected, access the failover unit administration UI on port 440 and reboot the failover unit. Monitor the heartbeat logs on the main when the failover system comes back up and check the log messages look OK.
Once the failover unit is back up, login to the command line via SSH from the main and issue the command:
# ip addr
Again, check if the failover system is in active or passive mode. Only the heartbeat interface should be up if the failover is in passive mode. If the failover is still in active mode after a reboot, try issuing the following command on the command line:
# smoothcom enterstandby
and check the log messages on the main to see if the failover and main unit are communicating correctly.
If the failover unit is still in active mode after these 2 attempts, contact our Support department.