NTP "FAQ" #3 (10/97 - 4/98) - Articles (part 11)

Previous part

From: Thomas Tornblom <thomast@dilbert.sun.se> Date: 09 Mar 1998 14:18:18 +0100 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: Nasty fast clock (Sol 2.5) - can it be tamed??? [-/+]X-Keywords: nsec_per_tick [-/+] syslog [-/+]

Gregory Bond <gnb@itga.com.au> writes:

> We have one of our Solaris 2.5.1 machines (a Ross SparcPlug cpu > module) that has a shocking clock that runs about 1% (yes, that's > 10,000ppm) fast. It's sync'd to a stratum-3 server that has good- > enough-for-our-purposes time (under 1/2 sec is what I'm aiming for - > I'm not too ambitious!) > > An extract from the syslog shows it's gaining time at a shocking rate > and getting lotsa of step time adjustments. This is Ugly on a > production system that is supposed to be providing timestamps to > important transactions.... > > I've turned off dosyntodr with no noticable improvement. Anything > else I can try (short of moving to a new machine)? >

If this is a sun4m class machine, then you can tweak the kernel variable nsec_per_tick.

If you can calculate the drift close enough to get xntpd to lock, then you can observe ntp.drift and make minor adjustments until you're satisfied with the result.

We have an old ss10 which I checked this morning and the drift was at ~97ppm. After running:

echo 'nsec_per_tick/W 0t10000900' |adb -w -k

is is now at 7.641 ppm.

I have the above adb line in /etc/init.d/xntpd

I will add 70 to the value to bring it closer to 0.

Notice that this procedure doesn't work on sun4u machines.

Thomas --

Thomas Tornblom Tel: +46 8 623 9100 E-mail: Thomas.Tornblom@Sun.SE Sun Microsystems AB Fax: +46 8 623 9102

From: wiu09524@rrzc4 (Ulrich Windl) [-/+]Date: 10 Mar 1998 15:03:40 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: Nasty fast clock (Sol 2.5) - can it be tamed??? [-/+]X-Keywords: HP-UX [-/+]

In article <bqhyaykt339.fsf@dilbert.sun.se> Thomas Tornblom <thomast@dilbert.sun.se> writes:

[...]

> echo 'nsec_per_tick/W 0t10000900' |adb -w -k

I wonder if anybody knows: HP-UX seems to have something similar, but it seems the value is taken out from non-volatile memory as the systems (9000/8xx) all have very small drift. It also seems that HP is presetting that NVRAM with the correct value. (I just think it could work like that; does anybody know details? HP-UX clock code is quite secret...)

[...]

Ulrich

From: Marc Brett <mbrett@rgs0.london.waii.com> [-/+]Date: 10 Mar 1998 09:51:44 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: Sydney Daylight Saving Time X-Keywords: daylight [-/+] DST [-/+]

Hillman Chan <hchan@elaustg1.att.net.au> wrote: > Hi all, > > I'm sorry if my question is not appropriate for this group. > > Can anybody tells me when Sydney starts/ends daylight saving time? The > current setting of my NT box is: > > Start: Last Sunday of Oct at 02:00 > End: First Sunday of Mar at 02:00 > > which is wrong. Any idea when can I get the correct time? > > Thx a lot!

>From <http://www.ft.uni-erlangen.de/~mskuhn/iso-time.html>:

For those readers interested in more information about time zones:

Arthur David Olson maintains a database of all current and historic time zone changes. It is availabe via ftp from elsie.nci.nih.gov in the pub/tzcode* and pub/tzdata* files. Most Unix time zone handling implementaions are based on this package. If you want to join the tz@elsie.nci.nih.gov mailing list, which is dedicated to discussions about time zones, please send a short message to tz-request@elsie.nci.nih.gov.

I checked my copy of the australasia file from this collection (Sep 1997) and it matches the above dates. There was a note that NSW was planning to move the DST start date forward for the 2000 Olympics. This may to take some time to be negotiated, because the plan is to do this in multiple states due to soccer games (which are not just in Sydney). In Australia, time zones are not legislated by the federal government, but by the individual states. Perhaps NSW has switched already(?). Your local reference or law library may be able to check the current legislation on this matter.

The Microsoft web site has a time zone editor ("tzedit"?) which can be used to modify the default rules for W95.

-- Marc Brett +44 181 560 3160 Western Geophysical Marc.Brett@waii.com 455 London Road, Isleworth FAX: +44 181 847 5711 Middlesex TW7 5AB UK

From: add@netcom.com (Arthur Darren Dunham) Date: Tue, 10 Mar 1998 19:11:41 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: Dial-up costs doubled since using XNTPD (Solaris) X-Keywords: configuration [-/+] reset [-/+]

In article <6e3uim$6gv$1@ezekiel.eunet.ie>, Noel O'Sullivan <noel@beacon.ie> wrote: >I use a Solaris 2.5 machine to dial-up our ISP on demand via SunISDN >v.1.0.4. This used to be up about 7-8 hours per day. Since using xntpd this >has increased to 16-17 hours per day. The ntp.log file indicates that xntp >is synchronising late at night. > >Is there any way to get it not to synchronise unless a dial-up connection is >already established?

I don't know anything about SunISDN, but most ISDN bridges and such can configure filters for packets that should not bring up the line. If this is available, far and away the best solution is to leave the NTP configuration alone, and mark UDP/123 traffic to not reset the idle timer and to not bring up the line.

When legitimate other traffic brings up the line, the ntp daemon will just start syncing right up.

If you do not have this facility then you're going to have to do something else. I think a more elegant solution would be to set up keys so that the daemon can be configured. After a particular hour, have a cron job remove all external servers from the daemon. In the morning, add them back.

A more limiting step would be to kill and restart the daemon every day, but you'll have your clock running free after the daemon is killed.

>Noel -- Darren Dunham ddunham@taos.com Unix System Administrator Taos Mountain Got some Dr. Pepper? Santa Clara, CA < Please move on, ...nothing to see here, please disperse >

From: Daniel Michel <micheld1@club-internet.fr> Date: Wed, 11 Mar 1998 00:19:50 -0500 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: ntpq X-Keywords: broadcast [-/+] delay [-/+] dispersion [-/+] documentation [-/+] manpage [-/+] multicast [-/+] peer [-/+]

You're right, I believe also it should be MILLIseconds !

To convince yourself, just do two 'date' commands simultaneously, and you will be able to judge the accuracy of the synchronization. (if the accuracy is a few milliseconds, you should not see any diffference with 'date').

Daniel MICHEL

Charles M. Coldwell wrote:

> I noticed what appears to be an error in the documentation for ntpq. > The section of the manpage describing the peers command reads as > > peers > Obtains a list of in-spec peers of the server, along with a summary > of each peer's state. Summary information includes the address of > the remote peer, the reference ID (0.0.0.0 if the refID is unknown), > the stratum of the remote peer, the type of the peer (local, unicast, > multicast or broadcast), when the last packet was received, the > polling interval, in seconds, the reachability register, in octal, and > the current estimated delay, offset and dispersion of the peer, all in > seconds. > > I think the last word above should be MILLIseconds. > > -- > > Charles M. Coldwell > Harvard-Smithsonian Center for Astrophysics > 60 Garden St. MS #20 > Cambridge, MA 02138 > (617) 495-7491

From: Klaus Kusche <Klaus.Kusche@ooe.gv.at> [-/+]Date: Wed, 11 Mar 1998 11:17:04 +0100 [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Setting up xntp for a few AIX-machines? [-/+]X-Keywords: adjustment [-/+] AIX [-/+] PARSE [-/+] syslog [-/+]

Richard Siggins wrote: > Anyone had any experience with XNTP 3.5.92 on AIX 4.1?

Yes, for me it worked more or less out of the box, syncing 6 local AIX 4.1.5 machines with no connection to any external time servers (I successfully used 3.5.90 before, too).

Problems I had so far:

* AIX 4.1.5 has a wrong entry for ntp in /etc/services: "123/tcp" instead of "123/udp".

* The memory problem already discussed in this newsgroup: When starting, xntpd locks its memory (makes it permanently resident, unpageable). On AIX, this locks the maximum allowed amount for the stack (64 MB in my case!). Use "ulimit -s 128" before starting xntpd!

* A lot of "tickadj" fiddling is required to allow smooth clock adjustment: I never got it working without tickadj, half of my machines work fine with the default "tickadj -A" value, others required several attempts until clock adjustment worked (for example, I have a RS/6000-F40 which has a hardware clock off by about 400 ppm, and after trying several values, I finally got it working using "tickadj -a 200" - both lower and higher values failed).

Yesterday evening, I got our new radio clock working, but up to now I was unable to find a suitable tickadj value for the machine it is connected to - the machine steps its clock every 20 minutes. Any help and insight welcome!

* On every start of xntpd, two errors are written into the syslog: PARSE receiver #0: stream_init: ioctl(fd,I_PUSH, "parse"): Invalid argument PARSE receiver #0: ppsclock_init: ioctl(fd, I_PUSH, "ppsclock"): Invalid argument However, xntpd seems to work fine in spite of the errors?!?

-- DI. Dr. Klaus Kusche Oberoesterreichische Landesregierung / Government of Upper Austria Rechenzentrum / Computing Centre Smail: Kaerntnerstrasse 16, A-4020 Linz, Austria (Europe) Phone: +43 732 7720 - 3394 Fax: +43 732 7720 - 3198 Email: Klaus.Kusche@ooe.gv.at

From: bwb@etl.nospam.gov (Bruce Bartram 303-497-6217) [-/+]Date: Wed, 11 Mar 1998 12:34:01 -0700 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Trimming Sun Ultra clocks

----- April 97 comp.protocols.time.ntp thread on Sun ultraSparc trimming >From: sean@ugcs.caltech.edu (M. Sean Bennett)

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 22 Apr 1997 16:48:34 GMT [-/+]Organization: California Institute of Technology, Pasadena, CA USA X-Keywords: cpu_tick_freq [-/+]

Please note that all ultra systems must run 2.5/2.5.1

If anyone has a good soloution to this please E-mail me.

> Bug Id: 4023118 > Category: kernel > Subcategory: other > State: integrated > Synopsis: sun4u doesn't keep accurate time > Description: > > [ bmc, 12/20/96 ] > > The clock on a sun4u drifts unacceptably. On a typical 143 mHz Ultra, > the clock took 1.0001350 seconds to count 1 second. While this may seem > trivial, it adds up quickly. In this case, the TOD chip will have to > pull the clock forward by 2 seconds every 4 hours and 7 minutes. > This drift rate is so high, that the clock is close to being too broken > for NTP to guarantee correctness (in order for NTP's mechanism to work, > it must be assured that the local clock drifts no more than 20 ms in 64 > seconds; this particular 143 mHz Ultra will drift by nearly 9 ms in that > period). This problem has been reproduced on virtually all sun4u > classes. > > The fundamental problem lies in the kernel's perception of ticks per > second. The PROM is responsible for determining this figure exactly, > and the kernel extracts it into the variable cpu_tick_freq. On sun4u's, > this number is disconcertingly round: 143000000, 167000000, 248000000, > etc. Indeed, a simple experiment revealed that these numbers were > quite far from the actual ticks per second. Typical was the 143 mHz > Ultra which was discovered to tick around 142,980,806 (+/- 10) times > per second. > <h3>Work around:</h3> > > Integrated in releases: s297_27 > Duplicate of: > Patch id: > See also: > Summary: >

>From: bmc@kiowa.eng.sun.com (Bryan Cantrill)

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 23 Apr 1997 11:51:45 GMT [-/+]Organization: Sun Microsystems Computer Corporation X-Keywords: bug [-/+] dosynctodr [-/+] nsec_per_tick [-/+]

In article <5jiq52$5ah@gap.cco.caltech.edu>, M. Sean Bennett <sean@ugcs.caltech.edu> wrote: [ snip of the material qouted by bmc listed above] ...

I'm all for open discussion of this problem, but just out of curiosity, where did you get the above bug report? It's from our internal database, and I was unaware that it was released to customers who aren't under NDA. In any case, I'm the "bmc" in the above report; it's important to note that I fixed the bug in 2.6. And, if there's a demand for it, I'll be happy to patch the fix back to 2.5 and 2.5.1.

In terms of a fix, if you use NTP with the appropriate hacks (e.g. dosynctodr set to 0), you _shouldn't_ have any problems. If you don't use NTP but still need more accurate time than 2.5 and 2.5.1 on sun4u provide, you could also look into setting nsec_per_tick to be accurate for your particular box. Let me know if you need more information about how to go about doing this...

- Bryan

---------------------------------------------------------------------- Bryan Cantrill, Solaris Performance. bmc@eng.sun.com (415) 786-3652

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]X-Keywords: nsec_per_tick [-/+]

Bryan Cantrill (bmc@kiowa.eng.sun.com) wrote: [SNIP] [snip of qouted material listed above]...

Howdy, I think that having nsec_per_tick working on Ultra's in 2.6 is GREAT !!!! Releasing patches for 2.5* would be very nice, too.

What about x86 versions ? (Inquiring minds are annoying....)

Thanks very much for the real info.

Bruce Bartram bbartram@etl.nospam.gov just another chimehead followup and email

>From: bmc@kiowa.eng.sun.com (Bryan Cantrill)

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 25 Apr 1997 12:46:28 GMT [-/+]Organization: Sun Microsystems Computer Corporation X-Keywords: adjustment [-/+] bug [-/+] cpu_tick_freq [-/+] dosynctodr [-/+] implementation [-/+] nsec_per_tick [-/+] release [-/+]

In article <5jojml$fk7@lilypad.rutgers.edu>, Rick Thomas <rbthomas@lilypad.rutgers.edu> wrote: >Bryan, > >Thanks! for letting us know that this longstanding bug has been fixed >in Solaris 2.6 - we all look forward to the official release! > >Since there are lots of people who won't be able to install 2.6 >immediately, I'd say there is considerable demand for a back-port >version of the fix that works with 2.5 and 2.5.1. Please do it!

I'll put it on the to-do list...because the fix is extremely architecture specific, it probably won't be until after 2.6 ships. However, I certainly appreciate the concern over sun4u's inaccurate time (I was dismayed too, if it's any consolation).

>I'm concerned, however when you suggest as a work-around, patching >"nsec_per_tick". That strategy for trimming the Solaris clock does >_not_ work on Ultras, as is well known in this newsgroup. (It works >fine on Sun4c's and 4m's, just not on Ultras.) Until your back-ported >fix arrives for 4u's running 2.5 and 2.5.1, do you have a work-around >that _does_ work on Ultras?

I realized almost immediately after posting this that I shouldn't have even mentioned nsec_per_tick; it won't help the cause on sun4u (but then, you know that). Here's why: the V9 architecture, like other modern architectures, maintains a register which counts clock ticks (%tick). The UltraSPARC implementation of V9 also adds a %tick compare register, TICK_CMPR_REG. This allows system software to request to be interrupted when %tick equals some arbitrary value. On all of our other Sun architectures (sun4c, sun4d, sun4m), we use the Counter/Timer to post the clock interrupt. The Counter/Timer actually deals in _time_ (i.e. microseconds), so when you're in the clock interrupt handler, you know that nsec_per_tick's nanoseconds have elapsed since the last clock interrupt. That's why patching nsec_per_tick has enabled you to compensate for drifting Counter/Timers. On sun4u, this isn't quite so simple: the number of nanoseconds which has elapsed since the last tick is (this_tick - lasttick) * (NANOSEC / cpu_tick_freq). Point is: we don't multiply by some easy-to-patch scaler.

So...

If you want to get accurate time on sun4u, there are two steps:

1. Determine the actual clock frequency of your CPU. 2. Patch your kernel's cpu_tick_freq variable, and patch the code which overwrites the cpu_tick_freq variable with the CPU's clock-frequency property.

When I originally constructed this, I had intended to describe how to do both. I'm realizing, however, that I better post a script to do the actual patching of the kernel. I'll post that tomorrow; this post will only cover how to get an approximation of your CPU's clock frequenecy (it'll take you a day to get the frequency, anyway).

The easiest way to do this is to observe when the time-of-day (TOD) chip steps in to pull time one direction or another (it will do this when the system's time has drifted by 2 seconds). In order for this approach to work, you obbiously need to make sure that NTP is off, dosynctodr is 1, etc. I've hacked together a little program which I find useful for gathering this kind of data. Here's the program:

% cat > gettime.c #include <signal.h> #include <time.h> #include <errno.h> #include <sys/time.h> #include <stdio.h> #include <stdlib.h> #include <sys/wait.h> #include <sys/priocntl.h> #include <sys/rtpriocntl.h> #include <fcntl.h>

#define DEFAULT_OUTPUT "/tmp/gettime.data"

#define _POSIX_PER_PROCESS_TIMER_SOURCE #define WEEKSEC 604800

struct timeval tp[WEEKSEC]; hrtime_t before[WEEKSEC], after[WEEKSEC];

int sec = 0; timer_t timer;

#define WORK_PRIO 35

int output_fd = 1;

void make_realtime() { long real_result; pcinfo_t priority_info; pcparms_t priority_settings; rtparms_t *rtp;

strcpy(priority_info.pc_clname,"RT"); if (priocntl(P_LWPID, P_MYID, PC_GETCID, (caddr_t) &priority_info) == -1) { perror ("priocntl"); exit (EXIT_FAILURE); }

priority_settings.pc_cid = priority_info.pc_cid; rtp = (rtparms_t *) priority_settings.pc_clparms; rtp->rt_pri = WORK_PRIO; rtp->rt_tqsecs = 2; rtp->rt_tqnsecs = 0; if (priocntl(P_LWPID,P_MYID, PC_SETPARMS,(caddr_t) &priority_settings) == -1) { perror ("priocntl"); exit (EXIT_FAILURE); } }

void every_sec(int ignore) { before[sec] = gethrtime(); gettimeofday(&tp[sec], NULL); after[sec] = gethrtime(); sec++; signal(SIGALRM, every_sec); }

void dump_gnuplot(int ignore) { int cursec = sec, i; hrtime_t lhs, rhs; char c[256];

lseek(output_fd, 0, SEEK_SET); for (i = 1; i < cursec; i++) { sprintf(c, "%d %lld %lld %lld ", i, lhs = (before[i] - before[0]) / (hrtime_t) 1000, rhs = (hrtime_t) (tp[i].tv_sec - tp[0].tv_sec) * (hrtime_t) MICROSEC + (hrtime_t) (tp[i].tv_usec - tp[0].tv_usec), lhs - rhs, (after[i] - before[i]) / (hrtime_t) 1000); write(output_fd, c, strlen(c)); sprintf(c, "\n"); write(output_fd, c, strlen(c)); } signal(SIGUSR1, dump_gnuplot); }

main(int argc, char *argv[]) { struct sigevent ev; struct itimerspec value; int c;

while ((c = getopt(argc, argv, "hroO:")) != EOF) switch (c) { case 'r': printf("gettime: running as realtime...\n", argv[0]); make_realtime(); break; case 'O': case 'o': { char *file = c == 'o' ? DEFAULT_OUTPUT : optarg; printf("gettime: dumping to %s...\n", file); if ((output_fd = open(file, O_WRONLY|O_CREAT|O_TRUNC, 0666)) == -1) { perror("open"); exit(1); } break; } case 'h': case '?': printf("\nUsage: gettime [-r] [-g] [-o|[-O file]]\n"); printf(" -r run in realtime class (must be root)\n"); printf(" -o dump output to %s\n", DEFAULT_OUTPUT); printf(" -O file dump output to file\n"); printf("\nNote: data is dumped upon receipt of SIGUSR1\n"); exit(0); }

signal(SIGALRM, every_sec); signal(SIGUSR1, dump_gnuplot);

ev.sigev_notify = SIGEV_SIGNAL; ev.sigev_signo = SIGALRM; if (timer_create(CLOCK_REALTIME, &ev, &timer) != 0) { perror("timer_create"); exit(1); } value.it_interval.tv_sec = 1; value.it_interval.tv_nsec = 0; value.it_value.tv_sec = 1; value.it_value.tv_nsec = 0;

if (timer_settime(timer, TIMER_RELTIME, &value, NULL) != 0) { perror("timer_settime"); exit(1); }

for (;;) pause(); } ^D % cc -o gettime gettime.c -lposix4 %

This program asks for a signal to be dropped on it once a second and in the signal handler it calls gettimeofday(3C) (which is subject to adjustments) surrounded by calls to gethrtime(3C) (which is not). So here's the kind of data this will generate:

% su Password: # ./gettime -h

Usage: gettime [-r] [-g] [-o|[-O file]] -r run in realtime class (must be root) -o dump output to /tmp/gettime.data -O file dump output to file

Note: data is dumped upon receipt of SIGUSR1 # ./gettime & 19353 # sleep 10 # kill -USR1 19353 1 1000522 1000490 32 2 2000537 2000505 32 3 3000517 3000485 32 4 4000533 4000502 31 5 5000487 5000456 31 6 6000511 6000480 31 7 7000479 7000447 32 8 8000585 8000553 32 9 9000511 9000479 32 10 10011339 10011308 31 11 11000499 11000466 33 12 12000514 12000483 31 13 13000506 13000474 32 14 14000507 14000473 34 15 15000551 15000520 31

Going from left to right, we have the number of wall seconds elapsed, the gethrtime() in microseconds, the gettimeofday() in microseconds, and the number of nanoseconds between the straddling gethrtime()'s (also, note that dropping the SIGUSR1 on the program doesn't kill it). Incidently, I typically start gettime with the "-ro" options to be realtime, and to dump to /tmp/gettime.data. So, run gettime in the background, and drop a USR1 on it every couple of hours. At some point in the data, you'll see something like:

... 9324 9324003485 9324002985 500 6 9325 9325003468 9325002968 500 7 9326 9326003530 9326003029 501 6 9327 9327003581 9327003080 501 6 9328 9328003588 9328003086 502 6 9329 9329003568 9329003067 501 6 9330 9330003713 9329945059 58654 8 9331 9331063624 9330938724 124900 6 9332 9332133742 9331941961 191781 7 9333 9333193746 9332935714 258032 6 9334 9334263764 9333938856 324908 6 9335 9335333710 9334941931 391779 7 9336 9336393819 9335935784 458035 8 9337 9337463803 9336938893 524910 7 9338 9338533834 9337942046 591788 6 9339 9339593892 9338935852 658040 8 9340 9340663778 9339938866 724912 1 9341 9341733886 9340942096 791790 7 9342 9342793875 9341935833 858042 4 9343 9343864104 9342939177 924927 9 9344 9344933916 9343942122 991794 4 9345 9345994006 9344935959 1058047 8 9346 9347064111 9345939183 1124928 8 9347 9348133998 9346942201 1191797 7 9348 9349194028 9347935979 1258049 6 9349 9350264128 9348939198 1324930 7 9350 9351334121 9349942317 1391804 6 9351 9352394144 9350936088 1458056 7 9352 9353464154 9351939222 1524932 7 9353 9354534209 9352942399 1591810 7 9354 9355594197 9353936136 1658061 5 9355 9356664254 9354939317 1724937 8 9356 9357734356 9355942719 1791637 201 9357 9358794273 9356936208 1858065 6 9358 9359864305 9357939363 1924942 5 9359 9360934309 9358942493 1991816 7 9360 9361994393 9359993891 2000502 7 9361 9363004430 9361003929 2000501 9 9362 9364004458 9362003956 2000502 7 9363 9365004385 9363003882 2000503 6 9364 9366004467 9364003964 2000503 6 9365 9367004482 9365003979 2000503 6 9366 9368004573 9366004071 2000502 8 ...

We can clearly see that on the 9330th second, the system time began an adjustment period which had finished by the 9360th second. Seeing this once is no good; you need to see it twice to determine clock frequency. More data, from later on the same run:

... 20302 20304005933 20302004869 2001064 6 20303 20305005910 20303004845 2001065 6 20304 20306005991 20303946154 2059837 7 20305 20307065864 20304939781 2126083 1 20306 20308135985 20305943022 2192963 7 20307 20309196015 20306936800 2259215 5 20308 20310266016 20307939927 2326089 8 20309 20311335989 20308943026 2392963 6 20310 20312396075 20309936858 2459217 7 20311 20313466064 20310939971 2526093 5 20312 20314536155 20311943182 2592973 6 20313 20315596143 20312936921 2659222 6 20314 20316666205 20313940104 2726101 8 20315 20317736156 20314943184 2792972 7 20316 20318796197 20315936971 2859226 6 20317 20319866196 20316940095 2926101 7 20318 20320936255 20317943276 2992979 7 20319 20321996213 20318936987 3059226 7 20320 20323066320 20319940212 3126108 6 20321 20324136326 20320943342 3192984 6 20322 20325196368 20321937132 3259236 8 20323 20326266374 20322940263 3326111 7 20324 20327336353 20323943368 3392985 7 20325 20328396328 20324937095 3459233 7 20326 20329466358 20325940246 3526112 5 20327 20330536335 20326943349 3592986 4 20328 20331596351 20327937115 3659236 6 20329 20332666447 20328940331 3726116 7 20330 20333736400 20329943410 3792990 5 20331 20334796406 20330937167 3859239 6 20332 20335866486 20331940368 3926118 8 20333 20336936414 20332943424 3992990 5 20334 20337996458 20333995392 4001066 6 20335 20339006486 20335005421 4001065 6 20336 20340006535 20336005468 4001067 7 ...

So on the 20304th second, an adjustment period began which was over by the 20334th second; we're gaining 2 seconds every 10,974 seconds (that's every 3 hours!) The advertised clock rate on this machine is 248 mHz, but, based on these numbers, it appears that the actual clock rate is more like 248.045198 mHz. Incidently, as you can infer from the code, I like to generate gnuplot graphs from timing data. If you use gnuplot, here's a .gpl file to graph the gettime data:

% cat > gettime.gpl set term postscript portrait "Helvetica" 12 set size 1, 1 set clip one set border set xlabel "Seconds elapsed" set ylabel "Milliseconds gettimeofday() behind gethrtime()"

set nokey set output "/tmp/gettime.ps" plot "/tmp/gettime.data" using 1:($2/1000) "%lf%*f%*f%lf" with lines 1 ^D % gnuplot gettime.gpl % ls -al /tmp/gettime.ps -rw-rw-r-- 1 bmc staff 11136 Apr 25 05:07 /tmp/gettime.ps

As a sidenote, I find these graphs very interesting to examine when using NTP.

It will probably take you six to seven hours to amass enough data to get a handle on the actual frequency of your clock. Once you've done that, it's time to patch the kernel. As I've mentioned above, the patching is a little too hairy to describe, so I'll write a script tomorrow which does it for you (and does the error checking to assure that you don't hurt yourself).

Sorry this process is such a whopping pain in the butt. It _is_ fixed in 2.6, and I _will_ patch it back. Also, I'm putting back a fix to assure that cpu_tick_freq won't be plowed if it's been patched or set in /etc/system (this will restore the ability for you to hack the kernel's perception of time).

More tomorrow, Bryan

---------------------------------------------------------------------- Bryan Cantrill, Solaris Performance. bmc@eng.sun.com (415) 786-3652

>From: michael shiplett <walrus@fuseki.aa.ans.net>

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 25 Apr 1997 21:10:23 -0400 [-/+]Organization: ANS

bmc@kiowa.eng.sun.com (Bryan Cantrill) writes:

Thanks for the very helpful explanation, code, and promise of the kernel patch script. I did run into an error in gettime.c and the explanation of the fields.

> void > dump_gnuplot(int ignore) > { [...] > for (i = 1; i < cursec; i++) { > sprintf(c, "%d %lld %lld %lld ", i, > lhs = (before[i] - before[0]) / (hrtime_t) 1000, > rhs = (hrtime_t) (tp[i].tv_sec - tp[0].tv_sec) * > (hrtime_t) MICROSEC + (hrtime_t) (tp[i].tv_usec - > tp[0].tv_usec), lhs - rhs, > (after[i] - before[i]) / (hrtime_t) 1000);

sprintf() has 5 arguments but only 4 format operators. It looks like the format string should be

sprintf(c, "%d %lld %lld %lld %lld", i,

This omission had me confused when I read your explanation of the output as the first output had 4 columns, but the useful output for how to locate TOD clock actions) had 5 columns :)

> Going from left to right, we have the number of wall seconds > elapsed, the gethrtime() in microseconds, the gettimeofday() in > microseconds, and the number of nanoseconds between the straddling > gethrtime()'s

With the correct version you end up with the same output except the penultimate column (going left to right) is the difference betweeen elapsed gettimeofday() and elapsed gethrtime() in microseconds. Of course, one gets this output even without the format fix, but it's explanation is in the code.

One thing I've noticed is every so often I end up with a large (> 15 microseconds) difference between the two gethrtime() calls. Sometimes the gettimeofday() result jumps forward and back at this time as well. This is on an afs client but with the `-nosettime' option and no time daemons running.

255 255014940 255014928 12 3 256 256014925 256014912 13 4 257 257014881 257015022 -141 157 258 258014868 258014856 12 3 259 259014844 259014832 12 4 ... 1021 1021009233 1021009221 12 3 1022 1022009216 1022009204 12 4 1023 1023009204 1023009192 12 153 1024 1024009166 1024009154 12 3 1025 1025009158 1025009146 12 4 ... 1279 1279013832 1279013820 12 4 1280 1280013797 1280013785 12 4 1281 1281013774 1281013916 -142 159 1282 1282013757 1282013745 12 3 1283 1283013726 1283013714 12 4

Anyway, thanks again for the code, and I look forward to the kernel patch.

michael

>From: michael shiplett <walrus@fuseki.aa.ans.net>

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 25 Apr 1997 23:33:09 -0400 [-/+]Organization: ANS Message-ID: <xm7pvvisbka.fsf@fuseki.aa.ans.net> References: <01bc4b51$af52bac0$9357b89d@sophie.noc.lexmark.com> <5jiq52$5ah@gap.cco.caltech.edu> <5jkt4h$ruq@engnews2.Eng.Sun.COM> <5jojml$fk7@lilypad.rutgers.edu> <5jq934$5s5@engnews2.Eng.Sun.COM> <xm7rafysi68.fsf@fuseki.aa.ans.net>

I, michael shiplett <walrus@fuseki.aa.ans.net> writes:

> One thing I've noticed is every so often I end up with > a large (> 15 microseconds) difference between the two gethrtime() > calls. Sometimes the gettimeofday() result jumps forward and back at > this time as well. This is on an afs client but with the `-nosettime' > option and no time daemons running.

There seem to be two regular, large periodic fluctuations in the difference between the two gethrtime() calls:

+ one occurs with a significant negative gethrtime() - gettimeoday() difference. The magnitude has ranged from 141 to 165. This started after 257 seconds and continued every 1024 seconds thereafter.

+ the other has no significant changes in the timer/time of day difference. This one started after 1024 seconds and also continued every 1024 thereafter.

This occurs every time I restart `gettime'.

`xntpd' and then `msntp' (with -a, the adjtime option) have run on this machine since reboot; neither is currently running.

michael

>From: bmc@kiowa.eng.sun.com (Bryan Cantrill)

From: Unknown Author <none> (auto-inserted) [-/+]Subject: Re: Solaris 2.5.1 ntp source [-/+]Date: 26 Apr 1997 11:54:37 GMT [-/+]Organization: Sun Microsystems Computer Corporation X-Keywords: cpu_tick_freq [-/+] delay [-/+]

In article <xm7pvvisbka.fsf@fuseki.aa.ans.net>, michael shiplett <walrus@fuseki.aa.ans.net> wrote: >There seem to be two regular, large periodic fluctuations in the >difference between the two gethrtime() calls: > + one occurs with a significant negative gethrtime() - > gettimeoday() difference. The magnitude has ranged from 141 to > 165. This started after 257 seconds and continued every 1024 > seconds thereafter. > > + the other has no significant changes in the timer/time of day > difference. This one started after 1024 seconds and also > continued every 1024 thereafter.

:)

1024 * sizeof(struct timeval) = 8192 1024 * sizeof(hrtime_t) = 8192 pagesize on sun4u = 8192

What you're seeing is the effect of faulting on the tp[] and before[] arrays. When you fault on the store to the before[] array, you see the long delay between the gethrtime() and the gettimeofday(). When you fault on the store to the tp[] array, you see the long delay after the gettimeofday(), but before the second gethrtime(). Faulting on the store to the after[] array doesn't affect anything because the timestamp has already been retrieved. I should have warned you that you would see this effect...if it bothers you, add the following lines to the end of make_realtime():

memset(tp, 0, sizeof(struct timeval) * WEEKSEC); memset(before, 0, sizeof(hrtime_t) * WEEKSEC); memset(after, 0, sizeof(hrtime_t) * WEEKSEC);

if (mlockall(MCL_CURRENT) != 0) { perror("mlockall"); exit(EXIT_FAILURE); }

This faults in all of your BSS, and then locks the translations down.

On to other business: first of all, mea culpa on the sprintf() typo. As you noticed, all of the output (and the explanation thereof) had been made with the extra arg. Secondly: I realized in the shower this morning that someone was going to note that it's just a hell of a lot easier (and more accurate) to let NTP determine the true frequency of your CPU. And dammit, Bill Sebok had exactly that observation. So mea culpa again (and thanks for pointing that out, Bill).

Finally: here's the script to patch your kernel. The comment contains its own warning (and apologies for the legalese...patching kernel text is definitely the kind of thing where one needs to CYA).

% cat > patchfreq #!/bin/ksh

# # File: patchfreq # Author: Bryan Cantrill (bmc@eng.sun.com), Solaris Performance # Modified: Sat Apr 26 04:00:59 PDT 1997 # # This is a little script to patch a 5.5 or 5.5.1 kernel to get around # the cpu_tick_freq inaccuracy. Before running this script, one must # know the true frequency of one's CPU; this can be derived by NTP, # or by observing the clock relative to the time-of-day chip over a # long period of time (the TOD will pull system time when it drifts # by more than two seconds). # # Patching a kernel can render a machine unbootable; do not run this # script unless you are prepared to accept that possibility. It # is advisable to have a backout path (e.g. net booting, an alternate # boot disk, an installation CD) should your machine fail to boot. # # This is not a product of Sun Microsystems, and is provided "as is", # without warranty of any kind expressed or implied including, but not # limited to, the suitability of this script for any purpose. #

if [ $# -eq 0 ]; then echo "Usage: $0 cpu_tick_freq [ alternate_kernel ]" exit 1 fi

cpu_tick_freq=$1 kernel=/platform/sun4u/kernel/unix

if [ $# -eq 2 ]; then kernel=$2 fi

if [ ! -w $kernel ]; then echo "$0: Cannot open $kernel for writing." exit 1 fi

arch=`echo utsname+404?s | adb $kernel | cut -d: -f2`

if [ ! $arch = "sun4u" ]; then echo "Patch only applies to sun4u" exit 1 fi

rel=`echo utsname+202?s | adb $kernel | cut -d: -f2`

if [ ! $rel = "5.5" ] && [ ! $rel = "5.5.1" ]; then echo "Patch only applies to 5.5 or 5.5.1..." exit 1 fi

nop="1000000" # nop store_mask="ffffe000" # mask out low 13 bits store="da256000" # st %o5, [%l5 + offset]

instr=`echo setcpudelay+34?X | adb $kernel | cut -d: -f 2 | nawk '{ print $1 }'`

if [ $instr = $nop ]; then echo "Instruction already patched..." else let masked="(16#$store_mask & 16#$instr) - 16#$store" if [ $masked -ne 0 ]; then echo "Couldn't find instruction to patch; aborting." exit 1 fi

if ! echo setcpudelay+34?W $nop | adb -w $kernel 1> /dev/null then echo "adb returned an unexpected error; aborting." fi fi

echo "Patching cpu_tick_freq to $cpu_tick_freq..."

if ! echo cpu_tick_freq?W 0t$cpu_tick_freq | adb -w $kernel 1> /dev/null; then echo "adb returned an unexpected error; aborting." exit 1 fi

echo "$kernel successfully patched." exit 0 ^D %

Let me know if it doesn't work for you (if it doesn't work, hopefully it be by failing to patch and not by turning your machine into a warm brick)...

- Bryan

---------------------------------------------------------------------- Bryan Cantrill, Solaris Performance. bmc@eng.sun.com (415) 786-3652

From: moshier@mediaone.net () [-/+]Date: 12 Mar 1998 02:50:57 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Local ATOM? [-/+]X-Keywords: rubidium [-/+]

As there was a request for them, I've posted the electrical schematics for the "rubidium PC" phase lock frequency synthesizer to this web location -- http://people.ne.mediaone.net/moshier

From: Marc Brett <mbrett@rgs0.london.waii.com> [-/+]Date: 11 Mar 1998 14:36:45 GMT [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Setting up xntp for a few AIX-machines? [-/+]X-Keywords: adjustment [-/+] AIX [-/+]

> Klaus Kusche wrote in message <350664A0.F55723D5@ooe.gv.at>... > >* A lot of "tickadj" fiddling is required to allow smooth clock > >adjustment: I never got it working without tickadj, half of > >my machines work fine with the default "tickadj -A" value, others > >required several attempts until clock adjustment worked > >(for example, I have a RS/6000-F40 which has a hardware clock off by > >about > >400 ppm, and after trying several values, I finally got it working > >using "tickadj -a 200" - both lower and higher values failed).

The version of xntpd which ships with AIX is adequate for most uses. If you want the latest udel version, I'd recommend 3-5.92a or later. Modern versions up to 3-5.92 had the wrong value of PRESET_TICKADJ which caused instability in many RS6K platforms with less-than-perfect clocks.

-- Marc Brett +44 181 560 3160 Western Geophysical Marc.Brett@waii.com 455 London Road, Isleworth FAX: +44 181 847 5711 Middlesex TW7 5AB UK

From: Gregory Bond <gnb@itga.com.au> [-/+]Date: 12 Mar 1998 12:38:28 +1100 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Ross SparcPlug fix (was Re: Nasty fast clock (Sol 2.5) - can it be tamed???)

As a followup to the followup, I said:

> We have one of our Solaris 2.5.1 machines (a Ross SparcPlug cpu > module) that has a shocking clock that runs about 1% (yes, that's > 10,000ppm) fast.

It turns out this is some sort of well-known problem with the Ross SparcPlugs and there is a fix (which involves soldering irons and changing chips) that will be performed on this machine in the next few weeks.

-- Gregory Bond ITG Australia Ltd, Melbourne, Australia <mailto:gnb@itga.com.au> <http://www.bby.com.au/~gnb> ~From: bruce@itga.com.au (Do not use this address. It catches junk email.) ~From: bruce@bby.com.au, bruce@melba.bby.com.au (So do these ones)

From: wiu09524@rrzc4 (Ulrich Windl) [-/+]Date: 13 Mar 1998 07:06:10 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: looking for _Current_ Primary Server list

In article <3507ED3E.75132BBA@bwa.net> Bret Watson <Bret.Watson@bwa.net> writes:

> I've noticed that several of the listings in the Primary Server list at > David's page are now invalid. IS there a current list in existence?

Probably the better approach is to contact the people having obsolete entries in the list. Ask them to get their entries updated or removed.

Ulrich

From: Sven Dietrich <sven@terrapin.csc.ncsu.edu> Date: Wed, 11 Mar 1998 19:49:51 -0800 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Server offsets: WWV vs. GPS X-Keywords: ATOM [-/+] delay [-/+] PPS [-/+] WWV [-/+] WWVB [-/+]

There is also the problem that the WWV clocks need to know what the radio delay from the source is, i.e. beeline distance to tower, this can add to milliseconds over hundreds of miles...

GPS doesn't have this problem!

For example, there is an NTP network in North Carolina, which is perpetually 20 - 30 milliseconds off. I know because I ran their time against a very precise GPS generated PPS interrupt, on a VERY FAST network (3 ms or less)

I actually hacked my interrupt to work with ATOM, and suddenly some of those lost clocks switched to the GPS, but that code broke when I started playing with 4.x and haven't fixed it yet, so now they're back to listening to that silly WWV clock...

S.

Arnaud Girsch wrote:

> Hiya, > > Followup on my own post .... > > Thanks for the answers I got, and especially to Bruce Bartram ! :-) > > I didn't apply any patch or anything yet .... I was probably just a little bit > too impatient. > > I sampled how fast the system was going, and calculated an initial value to put > in my drift file. > With that value (instead of 0, or empty), the host apparently converged to be > close enough > to the sources overnight, and stopped losing synchronization. > Now, with a drift of 214, it seems to stay pretty close, at about .001 s > > Now ... I've heard that several times in various places .... but is it "normal" > that sources based > on GPS and WWVB are so far from each other .... I mean, I see an offset of .010 > or .015 between > stratum 1 servers. > > Since my upstreams take from GPS or WWVB, it somtimes create problems jumping > from one > to another ... > > Arnaud. > > -- > Arnaud C. Girsch -+- The OASys Group, Inc. - A Cabletron Subsidiary > agirsch@OASysGroup.com -+- Tel: 408-872-0203 Fax: 408-872-0210 - Saratoga, CA

From: Marc Brett <mbrett@rgs0.london.waii.com> [-/+]Date: 16 Mar 1998 16:59:45 GMT [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Problem setting tickadj on AIX 4.1 and XNTP 3.5.9X [-/+]X-Keywords: AIX [-/+] configuration [-/+]

In comp.protocols.time.ntp Richard Siggins <rsiggins@eastman.com> wrote: > I had some problems getting XNTP 3.5.92 to run on a specific AIX 4.1 > system. Dropping back to version 3.5.90 fixed the core dump problem but > now I seem to have an unstable clock situation. It appears that xntpd is > picking up an invalid tickadj value of 5. Other similar systems (AIX 4.2, > AIX 3.2.5) have a value of 1000 for tickadj. Running the tickadj utility > changes the value in the kernal but xntpd does not pick it up. Similarly, > setting tickadj in the configuration file does not appear to have an > effect.

> Has anyone else had this problem? If so, how do you override the tickadj > value?

> -- > Richard Siggins > Eastman Chemical Co. > rsiggins@eastman.com

After you run ./configure, but before running make, replace the relevant line in config.h with:

#define PRESET_TICKADJ 1000

This should cure the drift on all AIX platforms. Versions 3-5.92a and beyond have this patch already, but as you note they have problems of their own...

Our experience is that an executable built on AIX 3.2.5 will run equally well on on AIX 4.1.x and 4.2.x.

-- Marc Brett +44 181 560 3160 Western Geophysical Marc.Brett@waii.com 455 London Road, Isleworth FAX: +44 181 847 5711 Middlesex TW7 5AB UK

From: Klaus Kusche <Klaus.Kusche@ooe.gv.at> [-/+]Date: Tue, 17 Mar 1998 08:19:17 +0100 [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Problem setting tickadj on AIX 4.1 and XNTP 3.5.9X [-/+]X-Keywords: AIX [-/+] configuration [-/+]

Richard Siggins wrote: > I had some problems getting XNTP 3.5.92 to run on a specific AIX 4.1 > system. Dropping back to version 3.5.90 fixed the core dump problem but > now I seem to have an unstable clock situation. It appears that xntpd is > picking up an invalid tickadj value of 5. Other similar systems (AIX 4.2, > AIX 3.2.5) have a value of 1000 for tickadj. Running the tickadj utility > changes the value in the kernal but xntpd does not pick it up. Similarly, > setting tickadj in the configuration file does not appear to have an > effect. > > Has anyone else had this problem? If so, how do you override the tickadj > value?

Same problem here: xntpd observes neither the value in the config file nor the actual kernel value (to the developers: If tickadj is able to read the correct kernel value, why can't xntpd do so, too?).

As far as I can tell, you have to set it via configure: --enable-tickadj=40 (or whatever you have set with tickadj) You might also experiment with --disable-accurate-adjtime

These settings seem to be cached by configure, better "make distclean" if you want configure to obey any new settings.

From: Klaus Kusche <Klaus.Kusche@ooe.gv.at> [-/+]Date: Fri, 13 Mar 1998 07:54:28 +0100 [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: >32K pages used by xntpd v5.92c on AIX 4.1.4 X-Keywords: AIX [-/+]

Greg Nowicki wrote: > I downloaded xntp3-5.92c, ran the configure script, and made the > executables without incident on AIX 4.1.4. When I started up xntpd, > there was a few second pause when the machine was unresponsive. > I noticed later that ps reported the process was taking up over 32K > pages.

See my (and several other's) posts in this group during the last three weeks: Use ulimit!

From: dalton@cup.hp.deletethis.com (David Dalton) [-/+]Date: 17 Mar 1998 21:14:22 GMT [-/+]Newsgroups: comp.protocols.time.ntp Subject: Re: Whats wrong here? X-Keywords: adjustment [-/+] broadcast [-/+] dispersion [-/+] driftfile [-/+] peer [-/+] stability [-/+] update [-/+]

Benjumea Mondejar Jaime (benjumea@teclix.fie.us.es) wrote: :>Hello, I'm trying to run xntp3-5.92-export.tar.gz in Solaris and Linux :>but although my time client connects whith the server, it doesn't :>update the system clock. I've revised the xntpd man page but I :>don't know whats wrong.

:>Here's my config file:

:>server 150.214.141.1 :>broadcast 150.214.7.255 :>driftfile /etc/ntp.drift

... snip ...

:>> peer 150.214.141.1 event 'event_reach' (0x84) status 'reach, conf, 1 :>> event, event_reach' (0x9014) :>> clock_update(150.214.141.1) :>> clock_select()

:>Using snoop, I find that the my client connects to the server and :>then the server answers but the clock remains untouched.

How long did you wait for the adjustment to happen? The absolute minimum is three minutes, and there is no realistic maximum. It all depends on the health of your servers and network connections. Run "ntpq -p" on the client and evaluate the dispersion. Run it repeatedly, 65 seconds apart. Does the dispersion improve? How large is the offset? Does the asterisk appear when the server is selected?

Remember that NTP is a sampling system, and it acquires data and evaluates it for sanity, health, stability, reliability, etc. Nothing happens instantaneously. If you want an instantaneous clock update, use "ntpdate". Many people can get by just running "ntpdate" from a cronjob. It is much easier to set up than the full-blown daemon process.

-- -> My $.02 only Not an official statement from HP {They make me say that} -- As far as we know, our computer has never had an undetected error. --------------------------------------------------------------------------- David Dalton dalton@cup.hp.deletethis.com 408/447-3016

From: "Richard Siggins" <rsiggins@eastman.com> Date: 17 Mar 1998 18:32:58 GMT [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Trouble with xntpd on AIX 4.1.5 - help needed! [-/+]X-Keywords: AIX [-/+] configuration [-/+]

Klaus You have helped me on my xntpd problems on AIX (tickadj value incorrect). THANKS!

I have a similar configuration to yours, with one exception. We are syncinc to a GPS satelite clock from an AIX 4.2 system running the IBM xntpd daemon. The 4.2 system is my stratum 2 server for all the other AIX clients (4.2, 4.1.4, 3.2.5) plus some non-AIX clients. This configuration seems to be pretty stable. I've found the xntpd code that comes with AIX 4.2 to be pretty good. Maybe upgrading to 4.2 is an option? -- Richard Siggins Eastman Chemical Co. rsiggins@eastman.com

Klaus Kusche <Klaus.Kusche@ooe.gv.at> wrote in article > > Now, I have added a radio clock receiver to the master. > The xntpd on the master sync's to that receiver nicely, > but it is unable to adjust the local clock correctly. > > * xntpd steps the local clock in regular intervals > (every 20 minutes or so). >

From: Klaus Kusche <Klaus.Kusche@ooe.gv.at> [-/+]Date: Wed, 18 Mar 1998 09:10:17 +0100 [-/+]Newsgroups: comp.protocols.time.ntp,comp.unix.aix Subject: Re: Trouble with xntpd on AIX 4.1.5 - help needed! [-/+]X-Keywords: AIX [-/+] bug [-/+] configuration [-/+]

Richard Siggins wrote: > > Klaus > You have helped me on my xntpd problems on AIX (tickadj value incorrect). > THANKS! > > I have a similar configuration to yours, with one exception. We are > syncinc to a GPS satelite clock from an AIX 4.2 system running the IBM > xntpd daemon. The 4.2 system is my stratum 2 server for all the other AIX > clients (4.2, 4.1.4, 3.2.5) plus some non-AIX clients. This configuration > seems to be pretty stable. I've found the xntpd code that comes with AIX > 4.2 to be pretty good. Maybe upgrading to 4.2 is an option?

I think I've got xntp to work well here now (after fixing a bug in 3-5.92c, see my article on comp.protocols.time.ntp yesterday, and using a tickadj of 1000 and --disable-accurate-adjtime).

AIX 4.2 is not an option here (we will wait until 4.3 stabilizes).

Interestingly, there is an "xntpd" coming with AIX 4.1.5, too, at least if the latest fileset levels are installed. However, there are several things I don't like about it: * It is a rather old version. * There are even man pages for it, but they say "for AIX 4.2.1 only". * The IBM AIX hotline also told me "not supported under AIX 4.1.5, although it's there". Inofficially, I heard that under AIX 4.1.5, IBM's xntpd doesn't work any smoother that a self-compiled one. * According to the man pages, it supports only one reference clock, namely the local hardware clock - useless for me!

From: dalton@cup.hp.deletethis.com (David Dalton) [-/+]Date: 17 Mar 1998 20:56:29 GMT [-/+]Newsgroups: comp.os.linux.networking,comp.protocols.time.ntp Subject: Re: Is jumpstarting Xntpd with ntpdate necesarry Yes/No [-/+]X-Keywords: adjustment [-/+] broadcast [-/+] delay [-/+] dispersion [-/+] firewall [-/+] maxpoll [-/+] minpoll [-/+] multicast [-/+] peer [-/+] poll [-/+] PPS [-/+] precision [-/+] prefer [-/+] synchronized [-/+] syslog [-/+] WWVB [-/+]

jbessels (j.bessels@g-bank.nl) wrote:

:>I've just installed xntp3-5.92 on my Red Hat 5.0 Linux system (2.0.32) :>on a HP Vectra 5/200 MMX PC. I've got a working xntpd server running on :>the firewall (I'm the FW admin.) which syncronized OK with my ISP. For :>testing purposes I've deliberately set my clock 5 minutes earlier to see :>if xntpd would correct it. Nothing seems to happen. I've changed the gap :>to only 1 minute, but still nothing happens. I've also waited about :>10-15 minutes to give xntpd "some time".

:>Ntpdate OTOH works fine and changes the time according to the timeserver :>on the firewall. /etc/rc.d/rc3.d/S83ntpd start starts the xntp daemon :>correctly. I can check this because when running I can't run the ntpdate :>program (right according to the man pages). Can someone explains what :>going one. And how can I check if xntpd works fine after I have run :>ntpdate (at reboot time).

The secret to debugging xntpd problems is to run "ntpq -p" repeatedly. This will tell you what the daemon is thinking, whether it is about to make an adjustment, whether it has made any contact with the server(s), whether the servers are healthy, etc.

Your xntpd daemon cannot be considered "healthy" or "in sync" until the dispersion drops well below 1000 (milliseconds) and the asterisk appears in column one. How long does this take? It depends on your network service quality more than anything else. There is no hard and fast rule for "enough time". Usually 15 minutes is enough if the network is not congested, but that probably is not the case for you. Run "ntpq -p" repeatedly.

Here is sample output from "ntpq" running on my own cluster server in Cupertino. NTP had been running for less than 24 hours, and stabilized very nicely. This machine has ethernet connection all over Cupertino site, and the routers have some sort of high-speed connection to other HP sites. Notice that some of the other machines being polled are 5000km away, but with offset less than 5 milliseconds and dispersion very low as well (even though delay is sometimes many tens of milliseconds).

ntpq> peers remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 124 64 377 0.00 -0.234 2.01 relay.hp.com listo.hp.com 2 u 875 1024 377 13.84 4.912 4.88 cosl4.cup.hp.co listo.hp.com 2 u 876 1024 377 4.38 -4.468 3.95 paloalto.cns.hp listo.hp.com 2 u 885 1024 377 5.84 0.762 2.18 chelmsford.cns. listo.hp.com 2 u 883 1024 377 89.45 2.160 11.40 atlanta.cns.hp. listo.hp.com 2 u 881 1024 377 63.20 -2.545 0.99 colorado.cns.hp listo.hp.com 2 u 883 1024 377 38.71 -1.110 2.01 boise.cns.hp.co listo.hp.com 2 u 875 1024 377 32.88 -2.015 2.23

Suspect NTP Network -------------------

Here is sample output from "ntpq" running on a suspect system. This machine has X25 connection (19200 baud) to the NTP server (which has a GPS clock), but I'm not sure of the distance in km. The "delay", "offset" and "disp" numbers are all very high. I am especially worried about the dispersion.

ntpq> peers remote refid st t when poll reach delay offset disp ============================================================================== big_srv 17.8.5.7 2 u 3 512 17 312.87 -249.15 1960.85

Now let's go over the meaning of each of the column headings and the measurements that go with them. Keep in mind that the most important columns are the last two, "offset" and "dispersion". "dispersion" reveals the quality of the network service (which the time service depends on very heavily).

remote SERVER NAME ------

This is the name of the NTP server. It is usually another UNIX machine (could be HP, DEC, SUN, anything), but could also be an external reference clock like GPS or WWVB radio clock or even a modem.

The character in the left margin indicates the fate of this peer in the clock selection process. The codes mean:

"*" selected for synchronization "#" selected for synchronization but distance exceeds maximum "o" selected for synchronization, PPS signal in use "+" included in the final selection set "x" designated falsticker by the intersection algorithm "." culled from the end of the candidate list "-" discarded by the clustering algorithm "blank" discarded due to high stratum and/or failed sanity checks

refid REFERENCE IDENTIFICATION -----

Usually the IP address of the server or the name of the external clock, but can also be a router between the client and server. Not important for our purposes.

st STRATUM --

This is a measure of distance to the true source of time. The HPGPS clock is stratum=0, the NTP daemon attached to the GPS clock is stratum=1, and big_srv (one more step away) is considered stratum=2 by all of it's clients.

t TYPE -

The possible types are:

l local (such as a GPS clock) u unicast (this is the most common type) m multicast b broadcast - netaddr (usually 0)

when ----

How long ago (in seconds) was the last response from this server? Not very important.

poll POLL PERIOD ----

How often (in seconds) are we making a query to this server?? 512 seconds (approx 8 minutes) and 1024 seconds (approx 17 minutes) are very popular for network connections, but a machine with an external clock (like GPS) should poll it every 64 seconds or less.

This number can be specified with the "minpoll" and "maxpoll" directives, but it is better to let the daemon adjust it as needed. After stabilizing at startup this number will move automatically to 1024 for network servers and 64 for external reference clocks.

reach REACHABILITY larger is better -----

How successful are we in reaching the server?

This is an 8 bit shift register with the most recent probe in the 2^0 position. Thus 001 indicates the most recent probe was answered, 357 indicates one probe was not answered, and 377 indicates all of the recent probes have been answered.

delay ROUND TRIP TIME smaller is better -----

How long (in milliseconds) did it take for the reply packet to come back when we sent a query to the server?

offset TIME DIFFERENCE smaller is better ------

How far apart (in milliseconds) are the server's clock and the client's clock? This is the principal measure that the customer is interested in. When this number exceeds 128 then NTP makes a big adjustment (and the message "synchronization lost" appears in the logfile).

disp DISPERSION smaller is better ----

How much does the "offset" measurement vary between samples? How repeatable is the "delay" measurement? This is an error bound estimate. It is based on:

precision delay/2 age of measurement / 86400

When this number exceeds 100 (milliseconds) it is very difficult for the daemon to keep the clock synchronized.

This "dispersion" number is a primary measure of network service quality. A slow X25 network not only has a sizeable round trip time, but the round trip time varies a lot from one query to the next. This is very bad for timekeeping purposes, because it makes the "offset" very hard to calculate. The real job of NTP is to manage the "offset" value and minimize it.

============================================================================= Below is some ntpq data from my own NTP test machine, beginning at daemon startup and continuing for about 24 hours. The first several examples are less than 5 minutes between each one. You could gather some excellent data like this from your machine with a cronjob that runs every 5 or 10 minutes that executes this command: ntpq -p >> /tmp/ntpq_data

You can see that the dispersion starts out artificially high and then quickly drops as real data is accumulated. On the third measurement the daemon has declared the radio clock to be stable and repeatable enough to be the selected server, and the * appears next to the WWVP_SPEC name. This is also when the syslog message "synchronized to 15.13.108.1" appears. The radio clock is selected because I use the "prefer" directive in /etc/ntp.conf.

Notice that the "poll" figures adjust automatically, rising from 64 to 1024 seconds for the network time servers. The radio clock "poll" stays at 64 seconds because the cost of polling the dedicated hardware is very low.

ntpq> peers remote refid st t when poll reach delay offset disp ============================================================================== WWVB_SPEC(1) .WWVB. 0 l 100 64 3 0.00 7.828 7886.32 relay.hp.com listo.hp.com 2 u 3 64 7 9.77 7.507 3890.03 cosl4.cup.hp.co listo.hp.com 2 u 3 64 7 3.48 16.229 3884.57 paloalto.cns.hp listo.hp.com 2 u 3 64 7 5.08 21.023 3883.29 chelmsford.cns. listo.hp.com 2 u 3 64 7 87.81 20.565 3882.32 atlanta.cns.hp. listo.hp.com 2 u 3 64 7 63.78 19.660 3896.33 colorado.cns.hp listo.hp.com 2 u 3 64 7 41.53 19.945 3882.54 boise.cns.hp.co listo.hp.com 2 u 3 64 7 34.52 12.610 3885.73

remote refid st t when poll reach delay offset disp ============================================================================== WWVB_SPEC(1) .WWVB. 0 l 109 64 3 0.00 7.828 7886.32 relay.hp.com listo.hp.com 2 u 12 64 7 9.77 7.507 3890.03 cosl4.cup.hp.co listo.hp.com 2 u 12 64 7 3.48 16.229 3884.57 paloalto.cns.hp listo.hp.com 2 u 12 64 7 5.08 21.023 3883.29 chelmsford.cns. listo.hp.com 2 u 12 64 7 87.81 20.565 3882.32 atlanta.cns.hp. listo.hp.com 2 u 12 64 7 63.78 19.660 3896.33 colorado.cns.hp listo.hp.com 2 u 12 64 7 41.53 19.945 3882.54 boise.cns.hp.co listo.hp.com 2 u 12 64 7 34.52 12.610 3885.73

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 65 64 377 0.00 4.115 20.22 relay.hp.com listo.hp.com 2 u 32 64 377 12.53 10.927 8.62 cosl4.cup.hp.co listo.hp.com 2 u 32 64 377 3.43 3.377 7.13 paloalto.cns.hp listo.hp.com 2 u 32 64 377 5.34 7.733 7.68 chelmsford.cns. listo.hp.com 2 u 32 64 377 85.97 7.086 13.73 atlanta.cns.hp. listo.hp.com 2 u 32 64 377 63.83 6.719 7.87 colorado.cns.hp listo.hp.com 2 u 32 64 377 41.18 6.390 9.02 boise.cns.hp.co listo.hp.com 2 u 32 64 377 34.53 2.931 15.61

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 114 64 377 0.00 37.623 12.77 relay.hp.com listo.hp.com 2 u 225 512 377 6.93 34.052 10.79 cosl4.cup.hp.co listo.hp.com 2 u 226 512 377 4.18 29.385 13.21 paloalto.cns.hp listo.hp.com 2 u 235 512 377 9.80 33.487 11.61 chelmsford.cns. listo.hp.com 2 u 233 512 377 88.79 30.462 9.66 atlanta.cns.hp. listo.hp.com 2 u 231 512 377 67.44 32.909 12.86 colorado.cns.hp listo.hp.com 2 u 233 512 377 43.70 30.077 18.63 boise.cns.hp.co listo.hp.com 2 u 224 512 377 33.42 31.682 8.54

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 96 64 377 0.00 41.279 4.29 relay.hp.com listo.hp.com 2 u 207 1024 377 38.88 56.319 27.47 cosl4.cup.hp.co listo.hp.com 2 u 208 1024 377 6.36 35.910 13.03 paloalto.cns.hp listo.hp.com 2 u 217 1024 377 5.80 40.161 12.37 chelmsford.cns. listo.hp.com 2 u 215 1024 377 90.36 38.449 12.68 atlanta.cns.hp. listo.hp.com 2 u 213 1024 377 64.09 37.787 11.25 colorado.cns.hp listo.hp.com 2 u 215 1024 377 44.07 38.568 17.72 boise.cns.hp.co listo.hp.com 2 u 206 1024 377 67.37 54.712 26.61

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 87 64 377 0.00 39.299 3.69 relay.hp.com listo.hp.com 2 u 6 1024 377 65.48 67.331 24.52 cosl4.cup.hp.co listo.hp.com 2 u 7 1024 377 4.32 35.311 6.41 paloalto.cns.hp listo.hp.com 2 u 16 1024 377 80.37 1.781 32.01 chelmsford.cns. listo.hp.com 2 u 14 1024 377 88.87 34.785 6.35 atlanta.cns.hp. listo.hp.com 2 u 12 1024 377 63.16 36.973 5.52 colorado.cns.hp listo.hp.com 2 u 14 1024 377 42.08 37.187 8.74 boise.cns.hp.co listo.hp.com 2 u 5 1024 377 36.00 38.077 13.23

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 118 64 377 0.00 29.347 3.07 relay.hp.com listo.hp.com 2 u 869 1024 377 65.48 67.331 24.52 cosl4.cup.hp.co listo.hp.com 2 u 870 1024 377 4.32 35.311 6.41 paloalto.cns.hp listo.hp.com 2 u 879 1024 377 80.37 1.781 32.01 chelmsford.cns. listo.hp.com 2 u 877 1024 377 88.87 34.785 6.35 atlanta.cns.hp. listo.hp.com 2 u 875 1024 377 63.16 36.973 5.52 colorado.cns.hp listo.hp.com 2 u 877 1024 377 42.08 37.187 8.74 boise.cns.hp.co listo.hp.com 2 u 868 1024 377 36.00 38.077 13.23

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 122 64 377 0.00 23.105 2.35 relay.hp.com listo.hp.com 2 u 425 1024 377 7.80 25.924 29.69 cosl4.cup.hp.co listo.hp.com 2 u 426 1024 377 4.96 23.267 10.71 paloalto.cns.hp listo.hp.com 2 u 435 1024 377 8.15 25.546 16.92 chelmsford.cns. listo.hp.com 2 u 433 1024 377 93.98 25.996 8.64 atlanta.cns.hp. listo.hp.com 2 u 431 1024 377 64.48 24.259 11.31 colorado.cns.hp listo.hp.com 2 u 433 1024 377 42.33 24.952 11.75 boise.cns.hp.co listo.hp.com 2 u 424 1024 377 51.56 28.568 12.30

remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 109 64 377 0.00 17.206 3.51 relay.hp.com listo.hp.com 2 u 988 1024 377 7.80 25.924 29.69 cosl4.cup.hp.co listo.hp.com 2 u 989 1024 377 4.96 23.267 10.71 paloalto.cns.hp listo.hp.com 2 u 998 1024 377 8.15 25.546 16.92 chelmsford.cns. listo.hp.com 2 u 996 1024 377 93.98 25.996 8.64 atlanta.cns.hp. listo.hp.com 2 u 994 1024 377 64.48 24.259 11.31 colorado.cns.hp listo.hp.com 2 u 996 1024 377 42.33 24.952 11.75 boise.cns.hp.co listo.hp.com 2 u 987 1024 377 51.56 28.568 12.30

THE NEXT DAY ============ remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 77 64 377 0.00 -0.428 1.94 relay.hp.com listo.hp.com 2 u 764 1024 377 17.11 -6.332 5.20 cosl4.cup.hp.co listo.hp.com 2 u 765 1024 377 5.08 -4.378 0.49 paloalto.cns.hp listo.hp.com 2 u 774 1024 377 7.80 -2.625 2.29 chelmsford.cns. listo.hp.com 2 u 777 1024 377 91.67 -2.011 5.66 atlanta.cns.hp. listo.hp.com 2 u 775 1024 377 64.39 -2.457 87.01 colorado.cns.hp listo.hp.com 2 u 783 1024 377 43.12 -0.855 3.22 boise.cns.hp.co listo.hp.com 2 u 774 1024 377 42.60 1.471 5.58

THE NEXT DAY ============ remote refid st t when poll reach delay offset disp ============================================================================== *WWVB_SPEC(1) .WWVB. 0 l 99 64 377 0.00 0.137 1.95 relay.hp.com listo.hp.com 2 u 658 1024 377 16.05 3.133 11.35 cosl4.cup.hp.co listo.hp.com 2 u 659 1024 377 4.10 -4.988 0.72 paloalto.cns.hp listo.hp.com 2 u 668 1024 377 5.87 6.015 0.84 chelmsford.cns. listo.hp.com 2 u 666 1024 377 88.30 1.554 3.52 atlanta.cns.hp. listo.hp.com 2 u 664 1024 377 70.31 -3.034 3.75 colorado.cns.hp listo.hp.com 2 u 666 1024 377 40.82 -1.843 5.81 boise.cns.hp.co listo.hp.com 2 u 657 1024 377 41.24 0.141 9.90

Next part