[Chapter 12] 12.3 Reading Debugging Output

12.3 Reading Debugging Output

We'll cover five examples of debugging output. The first example shows the name server starting up. The next two examples show successful name lookups. The fourth example shows a secondary name server keeping its zone up to date. In the last example, we switch from showing you name server behavior to showing you resolver behavior: the resolver search algorithm. After each trace, except the last one, we killed the name server and started it again so that each trace started with a fresh, nearly empty cache.

You might wonder why we've chosen to show normal name server behavior for all our examples; after all, this chapter is about debugging. We are showing you normal behavior because you have to know what normal operation is before you track down abnormal operation. Another reason is to help you understand the concepts (retransmissions, round trip times, etc.) we have described in earlier chapters.

12.3.1 Name Server Startup (Debug Level 1)

We'll start the debugging examples by watching the name server initialize. We used -d 1 on the command line, and this is the named.run output that resulted:

1)  Debug level 1
2)  Version = named 8.1.1 Sat Jul 19 08:06:36 EDT 1997
3)      pma@terminator:/home/pma/named
4)  conffile = named.conf
5)  starting.  named 8.1.1 Sat Jul 19 08:06:36 EDT 1997
6)      pma@terminator:/home/pma/named
7)  ns_init(named.conf)
8)  update_zone_info('0.0.127.IN-ADDR.ARPA', 1)
9)  source = db.127.0.0
10) purge_zone(0.0.127.IN-ADDR.ARPA,1)
11) reloading zone
12) db_load(db.127.0.0, 0.0.127.IN-ADDR.ARPA, 1, Nil)
13) np_parent(0x0) couldn't find root entry
14) master zone "0.0.127.IN-ADDR.ARPA" (IN) loaded (serial 1)
15) zone[1] type 1: '0.0.127.IN-ADDR.ARPA' z_time 0, z_refresh 0
16) update_zone_info('.', 3)
17) source = db.cache
18) reloading zone
19) db_load(db.cache, , 0, Nil)
20) cache zone "" (IN) loaded (serial 0)
21) zone[0] type 3: '.' z_time 0, z_refresh 0
22) getnetconf(generation 887560796)
23) getnetconf: considering lo [127.0.0.1]
24) ifp->addr [127.0.0.1].53 d_dfd 20
25) evSelectFD(ctx 0x808f0e0, fd 20, mask 0x1, func 0x8056bf0,
    uap 0x80ac350)
26) evSelectFD(ctx 0x808f0e0, fd 21, mask 0x1, func 0x806fb08,
    uap 0x80ac398)
27) listening [127.0.0.1].53 (lo)
28) getnetconf: considering eth0 [192.249.249.3]
29) ifp->addr [192.249.249.3].53 d_dfd 22
30) evSelectFD(ctx 0x808f0e0, fd 22, mask 0x1, func 0x8056bf0,
    uap 0x80ac408)
31) evSelectFD(ctx 0x808f0e0, fd 23, mask 0x1, func 0x806fb08,
    uap 0x80ac450)
32) listening [192.249.249.3].53 (eth0)
33) fwd ds 5 addr [0.0.0.0].1142
34) Forwarding source address is [0.0.0.0].1142
35) evSelectFD(ctx 0x808f0e0, fd 5, mask 0x1, func 0x8056bf0, uap 0)
36) exit ns_init()
37) Ready to answer queries.
38) prime_cache: priming = 0
39) evSetTimer(ctx 0x808f0e0, func 0x8054cf4,
    uap 0, due 887560800.000000000, inter 0.000000000)
40) sysquery: send -> [192.5.5.241].53 dfd=5 nsid=41705 id=0 41)
    retry=887560800
41) evSetTimer(ctx 0x808f0e0, func 0x804ee88,
    uap 0x80a4a20, due 887560803.377717000, inter 0.000000000)
42) datagram from [192.5.5.241].53, fd 5, len 436
43) 13 root servers

We added the line numbers to the debugging output; you won't see them in yours. Lines 2 through 6 give the version of BIND you are running and the name of the configuration file. Version 8.1.1 was released by ISC (Internet Software Consortium) in 1997. We used the configuration file in the current directory, ./named.conf, for this run.

Lines 7 through 21 show BIND reading the configuration file and the db files. This name server is a caching-only name server - the only files read are db.127.0.0 (lines 8 through 15) and db.cache (lines 16-21). Line 8 lists the zone being updated (0.0.127.IN-ADDR.ARPA) and line 9 shows the file containing the zone data (db.127.0.0). Line 10 indicates that any old data for the zone is purged before new data is added. Line 11 says the zone is being reloaded, even though the zone is actually being loaded for the first time. The zone data is loaded during lines 12 through 14. Ignore the useless error message on line 13. On lines 15 and 21, z_time is the time to check when this zone is up to date; z_refresh is the zone refresh time. These values only matter when the server is a secondary server for the zone.

Lines 22 through 35 show the initialization of file descriptors. (In this case, they're really socket descriptors.) File descriptors 20 and 21 (lines 24-26) are bound to 127.0.0.1, the loopback address. Descriptor 20 is a datagram socket and descriptor 21 is a stream socket. File descriptors 22 and 23 (lines 29-31) are bound to the 192.249.249.3 interface. Each interface address was considered and used - they would not be used if the interface had not been initialized, or if the address were already in the list. File descriptor 5 (lines 33-35) is bound to 0.0.0.0, the wildcard address. Most network daemons use only one socket bound to the wildcard address, not sockets bound to individual interfaces. The wildcard address picks up packets sent to any interface on the host. Let's digress for a moment to explain why named uses both a socket bound to the wildcard address and sockets bound to specific interfaces.

When named receives a request from an application or from another name server, it will receive the request on one of the sockets bound to a specific interface. If named did not have sockets bound to specific interfaces, it would receive the requests on the socket bound to the wildcard address. When named sends back a response, it uses the same socket descriptor that the request came in on. Why does named do this? When responses are sent out via the socket bound to the wildcard address, the kernel fills in the sender's address with the address of the interface the response was actually sent out on. This address may or may not be the same address that the request was sent to. When responses are sent out via the socket bound to a specific address, the kernel fills in the sender's address with that specific address - the same address the request was sent to. If the name server gets a response from an IP address it didn't know about, the response is tagged a "martian" and discarded. named tries to avoid martian responses by sending its responses on descriptors bound to specific interfaces, so the sender's address is the same address the request was sent to. However, when named sends out queries, it uses the wildcard descriptor, since there is no need to force a specific IP address.

Lines 38 through 43 show the name server sending out a system query to find out which name servers are currently serving the root domain. This is known as "priming the cache." The first server queried sent a response that included 13 name servers.

The name server is now initialized, and it is ready to answer queries.

12.3.2 A Successful Lookup (Debug Level 1)

Suppose you want to watch the name server look up a name. Your name server wasn't started with debugging. Send a USR1 signal to turn on debugging, look up the name, then send a USR2 signal to turn off debugging, like this:

# kill -USR1 `cat /etc/named.pid`
# /etc/ping galt.cs.purdue.edu.
# kill -USR2 `cat /etc/named.pid`

We did this; here's the resulting named.run file:

datagram from [192.249.249.3].1162, fd 20, len 36

req: nlookup(galt.cs.purdue.edu) id 29574 type=1 class=1
req: missed 'galt.cs.purdue.edu' as '' (cname=0)
forw: forw -> [198.41.0.10].53 ds=4 nsid=40070 id=29574 2ms retry 4sec
datagram from [198.41.0.10].53, fd 4, len 343

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40070
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 9, ADDITIONAL: 9
;;              galt.cs.purdue.edu, type = A, class = IN
EDU.                        6D IN NS    A.ROOT-SERVERS.NET.
EDU.                        6D IN NS    H.ROOT-SERVERS.NET.
EDU.                        6D IN NS    B.ROOT-SERVERS.NET.
EDU.                        6D IN NS    C.ROOT-SERVERS.NET.
EDU.                        6D IN NS    D.ROOT-SERVERS.NET.
EDU.                        6D IN NS    E.ROOT-SERVERS.NET.
EDU.                        6D IN NS    I.ROOT-SERVERS.NET.
EDU.                        6D IN NS    F.ROOT-SERVERS.NET.
EDU.                        6D IN NS    G.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET.             5w6d16h IN A    198.41.0.4
H.ROOT-SERVERS.NET.             5w6d16h IN A    128.63.2.53
B.ROOT-SERVERS.NET.             5w6d16h IN A    128.9.0.107
C.ROOT-SERVERS.NET.             5w6d16h IN A    192.33.4.12
D.ROOT-SERVERS.NET.             5w6d16h IN A    128.8.10.90
E.ROOT-SERVERS.NET.             5w6d16h IN A    192.203.230.10
I.ROOT-SERVERS.NET.             5w6d16h IN A    192.36.148.17
F.ROOT-SERVERS.NET.             5w6d16h IN A    192.5.5.241
G.ROOT-SERVERS.NET.             5w6d16h IN A    192.112.36.4
resp: nlookup(galt.cs.purdue.edu) qtype=1
resp: found 'galt.cs.purdue.edu' as 'edu' (cname=0)
resp: forw -> [192.36.148.17].53 ds=4 nsid=40071 id=29574 1ms
datagram from [192.36.148.17].53, fd 4, len 202

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40071
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 4
;;   galt.cs.purdue.edu, type = A, class = IN
PURDUE.EDU.                 2D IN NS    NS.PURDUE.EDU.
PURDUE.EDU.                 2D IN NS    MOE.RICE.EDU.
PURDUE.EDU.                 2D IN NS    PENDRAGON.CS.PURDUE.EDU.
PURDUE.EDU.                 2D IN NS    HARBOR.ECN.PURDUE.EDU.
NS.PURDUE.EDU.                  2D IN A     128.210.11.5
MOE.RICE.EDU.                   2D IN A     128.42.5.4
PENDRAGON.CS.PURDUE.EDU.              2D IN A  128.10.2.5
HARBOR.ECN.PURDUE.EDU.              2D IN A     128.46.199.76
resp: nlookup(galt.cs.purdue.edu) qtype=1
resp: found 'galt.cs.purdue.edu' as 'cs.purdue.edu' (cname=0)
resp: forw -> [128.46.199.76].53 ds=4 nsid=40072 id=29574 8ms
datagram from [128.46.199.76].53, fd 4, len 234

send_msg -> [192.249.249.3].1162 (UDP 20) id=29574
Debug off

First, notice that IP addresses are logged, not names - odd for a name server, don't you think? It's not that odd, though. If you are trying to debug a problem with looking up names, you don't want the name server looking up additional names just to make the debugging output more readable - the extra queries would interfere with the debugging. None of the debugging levels translate IP addresses into names. You'll have to use a tool (like the one we provide later) to convert them for you.

Let's go through this debugging output line by line. This detailed approach is important if you want to understand what each line means. If you turn on debugging, you're probably trying to find out why some name can't be looked up, and you're going to have to figure out what the trace means.

datagram from [192.249.249.3].1162, fd 20, len 36

A datagram came from the host with IP address 192.249.249.3 (terminator). You may see the datagram come from 127.0.0.1 if the sender is on the same host as the name server. The sending application used port 1162. The name server received the datagram on file descriptor (fd) 20. The startup debugging output, like the one shown earlier, will tell you which interface file descriptor 20 is bound to. The length (len) of the datagram was 36 bytes.

req: nlookup(galt.cs.purdue.edu) id 29574 type=1 class=1

Since the next debugging line starts with req, we know that the datagram was a request. The name looked up in the request was galt.cs.purdue.edu. The request id is 29574. The type=1 means the request is for address information. The class=1 means the class was IN. You will find a complete list of query types and classes in the header file /usr/include/arpa/nameser.h.

req: missed 'galt.cs.purdue.edu' as '' (cname=0)

The name server looked up the requested name and didn't find it. Then it tried to find a remote name server to ask; none was found until the root domain (the empty quotes). The cname=0 means the name server has not encountered a CNAME record. If it does see a CNAME record, the canonical name is looked up instead of the original name, and cname will be nonzero.

forw: forw -> [198.41.0.10].53 ds=4 nsid=40070 id=29574 2ms retry 4sec

The query was forwarded to the name server (port 53) on host 198.41.0.10 (j.root-servers.net). The name server used file descriptor 4 (which is the wildcard address) to send the query. The name server tagged this query with ID number 40070 (nsid=40070) so that it could match the response to the original question. The application used ID number 29574 (id=29574), as you saw on the nlookup line. The name server will wait 4 seconds before trying the next name server.

datagram from [198.41.0.10].53, fd 4, len 343

The name server on j.root-servers.net responded. Since the response was a delegation, it is printed in full in the debug log.

resp: nlookup(galt.cs.purdue.edu) qtype=1

After the information is cached in the response packet, the name is looked up again. As mentioned earlier, query type=1 means that the name server is looking for address information.

resp: found 'galt.cs.purdue.edu' as 'edu' (cname=0)
resp: forw -> [192.36.148.17].53 ds=4 nsid=40071 id=29574 1ms
datagram from [192.36.148.17].53, fd 4, len 202

The root server responded with a delegation to the edu servers. The same query is sent to 192.36.148.17 (i.root-servers.net), one of the edu servers. i.root-servers.net responds with information about the purdue.edu servers.

resp: found 'galt.cs.purdue.edu' as 'cs.purdue.edu' (cname=0)

This time there is some information at the cs.purdue.edu level.

resp: forw -> [128.46.199.76].53 ds=4 nsid=40072 id=29574 8ms

A query was sent to the name server on 128.46.199.76 (harbor.ecn.purdue.edu).This time the name server ID is 40072.

datagram from [128.46.199.76].53, fd 4, len 234

The name server on harbor.ecn.purdue.edu responded. We have to look at what happens next to figure out the contents of this response.

send_msg -> [192.249.249.3].1162 (UDP 20) id=29574

The last response must have contained the address requested, since the name server responded to the application (which used port 1162, if you look back at the original query). The response was in a UDP packet (as opposed to a TCP connection), and it used file descriptor 20.

This name server was "quiet" when we did this trace; it wasn't handling other queries at the same time. When you do a trace on an active name server you won't be so lucky. You'll have to sift through the output and patch together those pieces that pertain to the lookup in which you are interested. It's not that hard, though. Start up your favorite editor, search for the nlookup line with the name you looked up, then trace the entries with the same nsid. You'll see how to follow the nsid in the next trace.

12.3.3 A Successful Lookup with Retransmissions (Debug Level 1)

Not all lookups are as "clean" as the last one - sometimes the query must be retransmitted. The user doesn't see any difference so long as the lookup succeeds, although the query involving retransmissions will take longer. Following is a trace where there are retransmissions. We converted the IP addresses to names after the trace was done. Notice how much easier it is to read with names!

1)  Debug turned ON, Level 1
2)
3)  datagram from terminator.movie.edu port 3397, fd 20, len 35
4)  req: nlookup(ucunix.san.uc.edu) id 1 type=1 class=1
5)  req: found 'ucunix.san.uc.edu' as 'edu' (cname=0)
6)  forw: forw -> i.root-servers.net port 53  ds=4 nsid=2 id=1 0ms
    retry 4 sec
7)
8)  datagram from i.root-servers.net port 53, fd 4, len 240
    <delegation lines removed>
9)  resp: nlookup(ucunix.san.uc.edu) qtype=1
10) resp: found 'ucunix.san.uc.edu' as 'san.uc.edu' (cname=0)
11) resp: forw -> uceng.uc.edu port 53 ds=4 nsid=3 id=1 0ms
12) resend(addr=1 n=0) - > ucbeh.san.uc.edu port 53 ds=4 nsid=3
    id=1 0ms
13)
14) datagram from terminator.movie.edu port 3397, fd 20, len 35
15) req: nlookup(ucunix.san.uc.edu) id 1 type=1 class=1
16) req: found 'ucunix.san.uc.edu' as 'san.uc.edu' (cname=0)
17) resend(addr=2 n=0) - > uccba.uc.edu port 53 ds=4 nsid=3 id=1 0ms
18) resend(addr=3 n=0) - > mail.cis.ohio-state.edu port 53 ds=4 nsid=3
    id=1 0ms
19)
20) datagram from mail.cis.ohio-state.edu port 53, fd 4, len 51
21) send_msg -> terminator.movie.edu (UDP 20 3397) id=1

This trace starts out like the last trace (lines 1 through 11): the name server receives a query for ucunix.san.uc.edu, sends the query to an edu name server (i.root-servers.net), receives a response that includes a list of name servers for uc.edu, and sends the query to one of the uc.edu name servers (uceng.uc.edu).

What's new in this trace is the resend lines (lines 12, 17, and 18). The forw on line 11 counts as "resend(addr=0 n=0)" - CS dweebs always start counting with zero. Since uceng.uc.edu didn't respond, the name server went on to try ucbeh (line 12), uccba (line 17), and mail (line 18). The off-site name server on mail.cis.ohio-state.edu finally responded (line 20). Notice that you can track all of the retransmissions by searching for nsid=3; that's important to know because lots of other queries can be wedged between these.

Also, notice the second datagram from terminator (line 14). It has the same port, file descriptor, length, ID, and type as the query on line 3. The application didn't receive a response in time, so it retransmitted its original query. Since the name server is still working on the first query transmitted, this one is a duplicate. It doesn't say so in this output, but the name server detected the duplicate and dropped it. We can tell because there is no forw: line after the req: lines, as there was on lines 4 through 6.

Can you guess what this output might look like if the name server were having trouble looking up a name? You'd see a lot of retransmissions as the name server kept trying to look up the name (which you could track by matching the nsid= lines). You'd see the application send a couple more retransmissions, thinking that the name server hadn't received the application's first query. Eventually the name server would give up, usually after the application itself gave up.

12.3.4 A Slave Name Server Checking Its Zone (Debug Level 1)

In addition to tracking down problems with name server lookups, you may have to track down why a slave server is not loading from its master. Tracking down this problem can often be done by simply comparing the domain's SOA serial numbers on the two servers, using nslookup or dig, as we'll show in Chapter 13. If your problem is more elusive, you may have to resort to looking at the debugging information. We'll show you what the debugging information should look like if your server is running normally.

This debugging output was generated on a "quiet" name server - one not receiving any queries - to show you exactly which lines pertain to zone maintenance. If you remember, a slave name server uses a child process to transfer the zone data to the local disk before reading it in. While the slave logs its debugging information to named.run, the slave's child process logs its debugging information to xfer.ddt.PID. The PID suffix, by default the process ID of the child process, may be changed to ensure that the filename is unique. Beware - turning on debugging on a slave name server will leave xfer.ddt.PID files lying around, even if you are only trying to trace a lookup. Our trace is at debugging level 1 and we turned on the BIND 8 logging option print-time. Debug level 3 gives you more information, more than you may want if a transfer actually occurs. A debugging level 3 trace of a zone transfer of several hundred resource records can create an xfer.ddt.PID file several megabytes large:

21-Feb 00:13:18.026 do_zone_maint for zone movie.edu (class IN)
21-Feb 00:13:18.034 zone_maint('movie.edu')
21-Feb 00:13:18.035 qserial_query(movie.edu)
21-Feb 00:13:18.043 sysquery: send -> [192.249.249.3].53 dfd=5
                         nsid=29790 id=0 retry=888048802
21-Feb 00:13:18.046 qserial_query(movie.edu) QUEUED
21-Feb 00:13:18.052 next maintenance for zone 'movie.edu' in 2782 sec
21-Feb 00:13:18.056 datagram from [192.249.249.3].53, fd 5, len 380
21-Feb 00:13:18.059 qserial_answer(movie.edu, 26739)
21-Feb 00:13:18.060 qserial_answer: zone is out of date
21-Feb 00:13:18.061 startxfer() movie.edu
21-Feb 00:13:18.063 /usr/etc/named-xfer -z movie.edu -f db.movie
                    -s 26738 -C 1 -P 53 -d 1 -l xfer.ddt 192.249.249.3
21-Feb 00:13:18.131 started xfer child 390
21-Feb 00:13:18.132 next maintenance for zone 'movie.edu' in 7200 sec

21-Feb 00:14:02.089 endxfer: child 390 zone movie.edu returned
                         status=1 termsig=-1
21-Feb 00:14:02.094 loadxfer() "movie.edu"
21-Feb 00:14:02.094 purge_zone(movie.edu,1)

21-Feb 00:14:30.049 db_load(db.movie, movie.edu, 2, Nil)
21-Feb 00:14:30.058 next maintenance for zone 'movie.edu' in 1846 sec

21-Feb 00:17:12.478 slave zone "movie.edu" (IN) loaded (serial 26739)
21-Feb 00:17:12.486 no schedule change for zone 'movie.edu'

21-Feb 00:42:44.817 Cleaned cache of 0 RRs

21-Feb 00:45:16.046 do_zone_maint for zone movie.edu (class IN)
21-Feb 00:45:16.054 zone_maint('movie.edu')
21-Feb 00:45:16.055 qserial_query(movie.edu)
21-Feb 00:45:16.063 sysquery: send -> [192.249.249.3].53 dfd=5
                         nsid=29791 id=0 retry=888050660
21-Feb 00:45:16.066 qserial_query(movie.edu) QUEUED
21-Feb 00:45:16.067 next maintenance for zone 'movie.edu' in 3445 sec
21-Feb 00:45:16.074 datagram from [192.249.249.3].53, fd 5, len 380
21-Feb 00:45:16.077 qserial_answer(movie.edu, 26739)
21-Feb 00:45:16.078 qserial_answer: zone serial is still OK
21-Feb 00:45:16.131 next maintenance for zone 'movie.edu' in 2002 sec

Unlike the previous traces, each line in this trace has a timestamp. The timestamp makes clear which debug statements are grouped together.

This server is a slave for a single zone, movie.edu. The line with time 00:13:18.026 shows that it is time to check with the master server. The server queries for the zone's SOA record and compares serial numbers before deciding to load the zone. The lines with times 00:13:18.059 through 00:13:18.131 show you the zone's serial number (26739), tell you the zone is out of date, and start a child process (pid 390) to transfer the zone. At time 00:13:18.132, a timer is set to expire 7200 seconds later. This is the amount of time the server allows for a transfer to complete. At time 00:14:02.089, you see the exit status of the child process. The status of 1 indicates that the zone data was successfully transferred. The old zone data is purged (time 00:14:02.094), and the new data is loaded.

The next maintenance (see time 00:14:30.058) is scheduled for 1846 seconds later. For this zone, the refresh interval is 3600, but the name server chose to check again in 1846 seconds. Why? The name server is trying to avoid having the refresh periods become synchronized. Instead of using 3600 exactly, it uses a random time between half the refresh interval (1800) and the full refresh interval (3600). At 00:45:16.046, the zone is checked again and this time it is up-to-date.

If your trace ran long enough, you'd see more lines like the one at 00:42:44.817 - one line each hour. What's happening is that the server is making a pass through its cache, freeing any data that has expired, to reduce the amount of memory used.

The master server for this zone is a BIND 4 name server. If the master were a BIND 8 name server, the slave would be notified when a zone changed rather than waiting for the refresh interval to expire. The slave server's debug output would look almost exactly the same, but the trigger to check the zone status is a NOTIFY:

rcvd NOTIFY(movie.edu, IN, SOA) from [192.249.249.3].1059
qserial_query(movie.edu)
sysquery: send -> [192.249.249.3].53 dfd=5
          nsid=29790 id=0 retry=888048802


12.2 Turning On Debugging		12.4 The Resolver Search Algorithm and Negative Caching