Discussion:
[Bacula-users] Backup finished, but "Fatal error: Network error with FD"/"Connection reset by peer"
Raimund Sacherer
2015-08-03 10:52:43 UTC
Permalink
Hello,

we use bacula now for about 6 years and it works great. Some 6 month ago we switched to another Bacula Server. The switch included changing the OS from Linux to FreeBSD (to get better flexibility with ZFS, etc.).

Since the move we experience some problems. I keep logs for about 3 month so right now I can say that for about 3100 Backup Jobs 11 Jobs fail with:

02-Aug 09:13 backupserver-dir JobId 110334: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
02-Aug 09:13 backupserver-dir JobId 110334: Error: Bacula backupserver-dir 5.2.12 (12Sep12):

But it seems the backup has finished just fine. After changing the the volume status from Error to Used we can restore files.

It seems that after the backup is finished, a communication attempt between the director and the client fails somehow.

All those clients are in the same LAN Network. The backup time is comparable, also the amount of files etc.

It *seems* to only affect Windows, but I can not verify this fact as I do not have logs beyond 3 month.

I read some-where that there could be problems with some sort of timeout in the FreeBSD network stack, but before twiddling with some knobs I really would appreciate if someone else had similar problems in the past and knows what the root cause is.


Here an example of one failed (02. Aug) and two success backups from the same job over the last 3 weeks:


ERROR:
02-Aug 09:13 backupserver-dir JobId 110334: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
02-Aug 09:13 backupserver-dir JobId 110334: Error: Bacula backupserver-dir 5.2.12 (12Sep12):
Scheduled time: 01-Aug-2015 15:03:01
Start time: 01-Aug-2015 22:08:09
End time: 02-Aug-2015 09:13:31
Elapsed time: 11 hours 5 mins 22 secs
Priority: 12
FD Files Written: 547,868
SD Files Written: 547,868
FD Bytes Written: 1,202,478,052,857 (1.202 TB)
SD Bytes Written: 1,202,608,809,023 (1.202 TB)
Rate: 30120.7 KB/s


OK:
Scheduled time: 25-Jul-2015 15:03:00
Start time: 25-Jul-2015 21:52:17
End time: 26-Jul-2015 08:24:29
Elapsed time: 10 hours 32 mins 12 secs
Priority: 12
FD Files Written: 545,740
SD Files Written: 545,740
FD Bytes Written: 1,193,323,311,780 (1.193 TB)
SD Bytes Written: 1,193,453,558,522 (1.193 TB)
Rate: 31459.5 KB/s


OK:
Scheduled time: 18-Jul-2015 15:03:00
Start time: 18-Jul-2015 20:26:21
End time: 19-Jul-2015 05:52:25
Elapsed time: 9 hours 26 mins 4 secs
Priority: 12
FD Files Written: 543,122
SD Files Written: 543,122
FD Bytes Written: 1,176,812,345,702 (1.176 TB)
SD Bytes Written: 1,176,941,989,504 (1.176 TB)
Rate: 34648.8 KB/s



Thank you,
Best regards
Ray


------------------------------------------------------------------------------
Josh Fisher
2015-08-05 13:34:12 UTC
Permalink
Post by Raimund Sacherer
Hello,
we use bacula now for about 6 years and it works great. Some 6 month ago we switched to another Bacula Server. The switch included changing the OS from Linux to FreeBSD (to get better flexibility with ZFS, etc.).
02-Aug 09:13 backupserver-dir JobId 110334: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
But it seems the backup has finished just fine. After changing the the volume status from Error to Used we can restore files.
It seems that after the backup is finished, a communication attempt between the director and the client fails somehow.
All those clients are in the same LAN Network. The backup time is comparable, also the amount of files etc.
It *seems* to only affect Windows, but I can not verify this fact as I do not have logs beyond 3 month.
I read some-where that there could be problems with some sort of timeout in the FreeBSD network stack, but before twiddling with some knobs I really would appreciate if someone else had similar problems in the past and knows what the root cause is.
I have seen this before as well, although not with FreeBSD. Bacula-dir
expects the TCP connection with the client to remain up throughout the
entire job. In my case I concluded that it was aggressive Windows power
management shutting down the Ethernet interface PHY. I continue to have
problems with Mac OSX clients power management shutting down the
wireless PHY, but have not had time to investigate. With Windows 7 it is
possible to disable the "Allow the computer to turn off this device to
save power" setting in the Power Management tab of the network adapter's
Properties. It depends on the NIC driver as to whether or not this is
needed. Some drivers report that they handle various sleep states when
they in fact do not, or at least they do not return to D0 state in a
timely manner.

Another possibility for Windows 7 is Energy Efficient Ethernet. That
can be disabled too if for example the NIC supports EEE but a switch in
between (or the NIC driver in FreeBSD) does not, or if somewhere in
between client and Dir the EEE implementations do not agree on the same
"standard".

And finally, many switches also have TCP timeout settings and/or EEE and
power management that could potentially not work correctly with either
the FreeBSD or the Windows network stacks.

In any case, it is almost certainly a network issue, rather than a
general Bacula issue. Because Bacula leaves TCP connections open for
extended periods it is really good at discovering network issues.


------------------------------------------------------------------------------
Raimund Sacherer
2015-08-06 09:09:30 UTC
Permalink
Hello Josh, Bacula-users,
Post by Josh Fisher
I have seen this before as well, although not with FreeBSD. Bacula-dir
expects the TCP connection with the client to remain up throughout the
entire job. In my case I concluded that it was aggressive Windows power
management shutting down the Ethernet interface PHY. I continue to have
problems with Mac OSX clients power management shutting down the
wireless PHY, but have not had time to investigate. With Windows 7 it is
possible to disable the "Allow the computer to turn off this device to
save power" setting in the Power Management tab of the network adapter's
Properties. It depends on the NIC driver as to whether or not this is
needed. Some drivers report that they handle various sleep states when
they in fact do not, or at least they do not return to D0 state in a
timely manner.
We do not backup client computers, only servers. I really sincerely hope that a Windows Server does not do ethernet shutdowns or power management :-). But I am a Unix guy ...
Post by Josh Fisher
And finally, many switches also have TCP timeout settings and/or EEE and
power management that could potentially not work correctly with either
the FreeBSD or the Windows network stacks.
That sound's interesting, I saw a couple of posts talking about a keepalives, I will configure our FD's, SD's an the director for a 300 seconds timeout and we will see if we still get those errors.

Maybe it has nothing to do with the switch to FreeBSD, because at nearly the same time we migrated our servers from physical servers to VMWare, maybe it's the virtual vmware switch which makes troubles.

Well, in either case, i'l see how it goes with the keep alive configured,

Thank you
Best
Ray


------------------------------------------------------------------------------
Josh Fisher
2015-08-06 14:11:30 UTC
Permalink
Post by Raimund Sacherer
Hello Josh, Bacula-users,
Post by Josh Fisher
I have seen this before as well, although not with FreeBSD. Bacula-dir
expects the TCP connection with the client to remain up throughout the
entire job. In my case I concluded that it was aggressive Windows power
management shutting down the Ethernet interface PHY. I continue to have
problems with Mac OSX clients power management shutting down the
wireless PHY, but have not had time to investigate. With Windows 7 it is
possible to disable the "Allow the computer to turn off this device to
save power" setting in the Power Management tab of the network adapter's
Properties. It depends on the NIC driver as to whether or not this is
needed. Some drivers report that they handle various sleep states when
they in fact do not, or at least they do not return to D0 state in a
timely manner.
We do not backup client computers, only servers. I really sincerely hope that a Windows Server does not do ethernet shutdowns or power management :-). But I am a Unix guy ...
Yes, but server NICs supporting 802.3az are green independently of the
OS, other than the NIC driver sending a Low Power Idle request. The
firmware shuts down the PHY transmitter after a period of sending LPI
symbols, and the Dir-FD connection is idle for an extended time. What do
pre-2010 switches without 802.3az support do when those NICs shut down
their transmitters? If they are green enough to shutdown the port then
it may well look like a dropped connection on one end or the other.
Post by Raimund Sacherer
Post by Josh Fisher
And finally, many switches also have TCP timeout settings and/or EEE and
power management that could potentially not work correctly with either
the FreeBSD or the Windows network stacks.
That sound's interesting, I saw a couple of posts talking about a keepalives, I will configure our FD's, SD's an the director for a 300 seconds timeout and we will see if we still get those errors.
Maybe it has nothing to do with the switch to FreeBSD, because at nearly the same time we migrated our servers from physical servers to VMWare, maybe it's the virtual vmware switch which makes troubles.
Well, in either case, i'l see how it goes with the keep alive configured,
Let us know, please. I had mixed results with the keep alive, while
replacing an old switch seemed to magically fix Windows clients.
Post by Raimund Sacherer
Thank you
Best
Ray
------------------------------------------------------------------------------
Josh Fisher
2015-08-07 12:47:21 UTC
Permalink
Post by Raimund Sacherer
Post by Josh Fisher
And finally, many switches also have TCP timeout settings and/or EEE and
power management that could potentially not work correctly with either
the FreeBSD or the Windows network stacks.
That sound's interesting, I saw a couple of posts talking about a keepalives, I will configure our FD's, SD's an the director for a 300 seconds timeout and we will see if we still get those errors.
Maybe it has nothing to do with the switch to FreeBSD, because at nearly the same time we migrated our servers from physical servers to VMWare, maybe it's the virtual vmware switch which makes troubles.
Well, in either case, i'l see how it goes with the keep alive configured,
Fyi, version 7.x of the client daemon added progress data. The FD sends
progress data to the Dir every 30 seconds. In version 5.x the Dir - FD
connection sat idle during a backup. If your Windows FDs are 5.x then
that could explain why they fail on the same network where the other FDs
do not.


------------------------------------------------------------------------------
Raimund Sacherer
2015-08-10 09:52:16 UTC
Permalink
----- Original Message -----
Sent: Friday, August 7, 2015 2:47:21 PM
Subject: Re: [Bacula-users] Backup finished, but "Fatal error: Network error
with FD"/"Connection reset by peer"
Fyi, version 7.x of the client daemon added progress data. The FD sends
progress data to the Dir every 30 seconds. In version 5.x the Dir - FD
connection sat idle during a backup. If your Windows FDs are 5.x then
that could explain why they fail on the same network where the other FDs
do not.
Hello Josh,

thank you very much for your very insightful input. I'l try to implement the keepalive's this week and we will upgrade our FD's on windows to the latest 7.x.

I'l report back on how it goes, but it'l take some time.

Thanks,
Best
Ray

------------------------------------------------------------------------------
Loading...