Description
What problem/issue/behaviour are you having trouble with? What do you expect to see?
Chrony Marking NTP servers as offline Upon reboot of the nodes.
We do not see anything in the message log or puppet logs
Customers’ Multiple Data centers/Sites got affected by this issue.
Due to the marking of NTP offline, the clock went out of sync, resulting in a loss of traffic and causing a nationwide outage.
Output from some of the affected nodes.
This is not limited to AlmaLinux 8; We are facing the same in CentOS 7 & CentOS 8
Currently, we have output from CentOS 7
Currently, we captured the data on CentOS, they are the affected nodes in customer’s setup
Note: In our implementation, we are not managing network interfaces with Network Manager and Looked into the following.
And it did not make any difference.
Another Note: Some nodes are affected in Multiple Data Centers, and the rest are fine.
Affected nodes, it’s reproducible upon reboot and where we have not seen this issue, even in the same env on different nodes.
it’s not happening over there.
[qns@allntx1pcc01v ~]$ uname -a
Linux allntx1pcc01v 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[qns@allntx1pcc01v ~]$ cat /etc/*-release
CentOS Linux release 7.6.1810 (Core)
NAME=“CentOS Linux”
VERSION=“7 (Core)”
ID=“centos”
ID_LIKE=“rhel fedora”
VERSION_ID=“7”
PRETTY_NAME=“CentOS Linux 7 (Core)”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:centos:centos:7”
HOME_URL=“https://www.centos.org/”
BUG_REPORT_URL=“https://bugs.centos.org/”
CENTOS_MANTISBT_PROJECT=“CentOS-7”
CENTOS_MANTISBT_PROJECT_VERSION=“7”
REDHAT_SUPPORT_PRODUCT=“centos”
REDHAT_SUPPORT_PRODUCT_VERSION=“7”
CentOS Linux release 7.6.1810 (Core)
CentOS Linux release 7.6.1810 (Core)
[qns@allntx1pcc01v ~]$ rpm -qa | sort -i | grep -i ‘(chrony|ntp)’
chrony-3.2-2.el7.x86_64
fontpackages-filesystem-1.44-8.el7.noarch
ntp-4.2.6p5-28.el7.centos.x86_64
ntpdate-4.2.6p5-28.el7.centos.x86_64
[qns@allntx1pcc01v ~]$ ps auxwwwf | grep -i ‘(chrony|ntp)’
ntp 30958 0.0 0.0 25720 1896 ? Ss 2021 0:50 /usr/sbin/ntpd -u ntp:ntp -g
qns 3827 0.0 0.0 112708 1012 pts/0 S+ 19:30 0:00 _ grep --color=auto -i (chrony|ntp)
[qns@allntx1pcc01v ~]$ date
Thu Sep 1 19:30:04 UTC 2022
[qns@allntx1pcc01v ~]$ about.sh
Cisco Policy Suite - Copyright (c) 2015. All rights reserved.
CPS Multi-Node Environment
CPS Installer Version - 19.5.0
…-X-SNIP-X-…
[qns@allntx4pps02va ~]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample
^? lb01 2 10 0 44d +3581ms[+3581ms] +/- 16.0s
^? lb02 3 10 0 47d +46us[ +46us] +/- 15.9s
[qns@allntx4pps02va ~]$ chronyc tracking
Reference ID : AC1AF20C (lb02)
Stratum : 4
Ref time (UTC) : Thu Jun 09 10:20:05 2022
System time : 0.000000153 seconds slow of NTP time
Last offset : -0.000002489 seconds
RMS offset : 0.187102452 seconds
Frequency : 17.639 ppm slow
Residual freq : -0.000 ppm
Skew : 0.002 ppm
Root delay : 0.038741432 seconds
Root dispersion : 7.168707848 seconds
Update interval : 1031.8 seconds
Leap status : Normal
[qns@allntx4pps02va ~]$ chronyc sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
lb01 18 11 92m +6.918 0.022 +30.4s 42us
lb02 64 33 18h +0.000 0.000 +885us 18us
[qns@allntx4plb01va log]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample
^* akrldntp01.wnsnet.attws.> 1 6 0 50d -25us[ -161us] +/- 17ms
^- schldntp01.wnsnet.attws.> 2 6 0 50d +685us[ +685us] +/- 34ms
[qns@allntx4plb01va log]$ chronyc tracking
Reference ID : 9BAED650 (akrldntp01.wnsnet.attws.com)
Stratum : 2
Ref time (UTC) : Wed Jun 08 06:58:03 2022
System time : 0.000000028 seconds slow of NTP time
Last offset : -0.000135603 seconds
RMS offset : 0.000135603 seconds
Frequency : 15.377 ppm slow
Residual freq : -31.511 ppm
Skew : 0.026 ppm
Root delay : 0.030828953 seconds
Root dispersion : 142.074813843 seconds
Update interval : 2.1 seconds
Leap status : Normal
[qns@allntx4plb01va log]$ chronyc sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
akrldntp01.wnsnet.attws.> 4 4 6 -31.511 1011.791 -137.6s 149us
schldntp01.wnsnet.attws.> 4 4 6 -179.900 1376.575 -785.5s 152us
GOOD SITE
vnnyca2pcc01va
CPS Diagnostics HA Multi-Node Environment
Checking for clock skew for all VMs…[PASS]
[qns@vnnyca2plb01va ~]$ renice 19 -p $$
3901420 (process ID) old priority 0, new priority 19
[qns@vnnyca2plb01va ~]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample
^* bthldntp01.wnsnet.attws.> 1 10 377 444 +19us[ +83us] +/- 16ms
^- alnldntp01.wnsnet.attws.> 2 10 377 397 -678us[ -678us] +/- 33ms
[qns@vnnyca2plb01va ~]$ chronyc tracking
Reference ID : AC1B114F (bthldntp01.wnsnet.attws.com)
Stratum : 2
Ref time (UTC) : Thu Jul 28 19:50:29 2022
System time : 0.000032692 seconds fast of NTP time
Last offset : +0.000064008 seconds
RMS offset : 0.000065005 seconds
Frequency : 28.827 ppm slow
Residual freq : +0.025 ppm
Skew : 0.079 ppm
Root delay : 0.029123629 seconds
Root dispersion : 0.002041835 seconds
Update interval : 1030.3 seconds
Leap status : Normal
[qns@vnnyca2plb01va ~]$ chronyc sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
bthldntp01.wnsnet.attws.> 6 3 86m +0.025 0.089 +12us 50us
alnldntp01.wnsnet.attws.> 42 23 11h -0.011 0.026 -389us 573us
qns@gsvlfl4plb02va ~]$ chronyc sources
210 Number of sources = 2
MS Name/IP address Stratum Poll Reach LastRx Last sample
^+ bthldntp01.wnsnet.attws.> 2 6 0 5d +426us[ +426us] +/- 17ms
^* alnldntp01.wnsnet.attws.> 2 6 0 5d -474us[ +12us] +/- 55ms
[qns@gsvlfl4plb02va ~]$ chronyc activity
200 OK
0 sources online
2 sources offline
0 sources doing burst (return to online)
0 sources doing burst (return to offline)
0 sources with unknown address
[qns@gsvlfl4plb02va ~]$ chronyc tracking
Reference ID : AC1B0160 (alnldntp01.wnsnet.attws.com)
Stratum : 3
Ref time (UTC) : Fri Sep 09 07:42:20 2022
System time : 0.000000191 seconds slow of NTP time
Last offset : +0.000486426 seconds
RMS offset : 0.000486426 seconds
Frequency : 38.600 ppm slow
Residual freq : +99.831 ppm
Skew : 0.025 ppm
Root delay : 0.055775598 seconds
Root dispersion : 46.592983246 seconds
Update interval : 2.2 seconds
Leap status : Normal
[qns@gsvlfl4plb02va ~]$ uptime
15:57:43 up 5 days, 8:15, 1 user, load average: 4.80, 5.39, 5.62
[qns@gsvlfl4plb02va ~]$ date -d “5 days ago”
Fri Sep 9 15:57:55 UTC 2022
[qns@gsvlfl4plb02va ~]$ ps -aeo pid,comm,lstart | grep chronyd
920 chronyd Fri Sep 9 07:42:01 2022
What is the business impact? Please also provide timeframe information.
Loss of traffic, Nationwide outage, Business loss
Where are you experiencing the behavior? What environment?
Customer’s Implementation
When does the behavior occur? Frequency? Repeatedly? At certain times?
Fri Sep 9 15:57:55 UTC 2022