Learning Perl

Learning PerlSearch this book
Previous: 16.1 Getting Password and Group InformationChapter 16
System Database Access
Next: 16.3 Getting Network Information
 

16.2 Packing and Unpacking Binary Data

The password and group information is nicely represented in textual form. Other system databases are more naturally represented in other forms. For example, the IP address of an interface is internally managed as a four-byte number. While it is frequently decoded into a textual representation consisting of four small integers separated by periods, this encoding and decoding is wasted effort if a human is not interpreting the data in the meantime.

Because of this, the network routines in Perl that expect or return an IP address use a four-byte string that contains one character for each sequential byte in memory. While constructing and interpreting such a byte string is fairly straightforward using chr and ord (not presented here), Perl provides a short cut that is equally applicable to more difficult structures.

The pack function works a bit like sprintf, taking a format control string and a list of values, and creating a single string from those values. The pack format string is geared towards creating a binary data structure, however. For example, here's how to take four small integers and pack them as successive unsigned bytes in a composite string:

$buf = pack("CCCC", 140, 186, 65, 25);

Here, the pack format string is four C's. Each C represents a separate value taken from the following list (similar to what a % field does in sprintf). The C format (according to the Perl manpages, the reference card, Programming Perl, the HTML files, or even Perl: The Motion Picture) refers to a single byte computed from an unsigned character value (a small integer). The resulting string in $buf is a four-character string - each character being one byte from the four values 140, 186, 65, and 25.

Similarly, the format l generates a signed long value. On many machines, this is a four-byte number, although this format is machine-dependent. On a four-byte "long" machine, the statement

$buf = pack("l",0x41424344);

generates a four-character string that looks like either ABCD or DCBA, depending on whether the machine is little-endian or big-endian (or something entirely different if the machine doesn't speak ASCII). This happens because we are packing one value into four characters (the length of a long integer), and the one value just happens to be composed of the bytes representing the ASCII values for the first four letters of the alphabet. Similarly,

$buf = pack("ll", 0x41424344, 0x45464748);

creates an eight-byte string consisting of ABCDEFGH or DCBAHGFE, once again depending on whether the machine is little- or big-endian.

The exact list of the various pack formats is given in the reference documentation (perlfunc (1), or Programming Perl). You'll see a few here as examples, but we're not going to list them all.

What if you were given the eight-byte string ABCDEFGH and were told that it was really the memory image (one character is one byte) of two long (four-byte) signed values? How would you interpret it? Well, you'd need to do the inverse of pack, called unpack. This function takes a format control string (usually identical to the one you'd give pack) and a data string, and returns a list of values that make up the memory image defined in the data string. For example, let's take that string apart:

($val1,$val2) = unpack("ll","ABCDEFGH");

This gives us back something like 0x41424344 for $val1, or possibly 0x44434241 instead (depending on big-endian-ness). In fact, by the values that come back, we can determine if we are on a little- or big-endian machine.

Whitespace in the format control string is ignored, and can be used for readability. A number in the format control string generally repeats the previous specification that many times. For example, CCCC can also be written C4 or C2C2 with no change in meaning. (A few of the specifications use a trailing number as a part of the specification, and thus cannot be multiplied like that.)

A format character can also be followed by a *, which repeats the format character enough times to swallow up the rest of the list or the rest of the binary image string (depending on whether you are packing or unpacking). So, here's another way to pack four unsigned characters into a string:

$buf = pack("C*", 140, 186, 65, 25);

The four values here are swallowed up by the one format specification. If you had wanted two short integers followed by "as many unsigned chars as possible," you can say something like this:

$buf = pack("s2 C*", 3141, 5926, 5, 3, 5, 8, 9, 7, 9, 3, 2);

Here, we take the first two values as shorts (generating four or eight characters, probably) and the remaining nine values as unsigned characters (generating nine characters, almost certainly).

Going in the other direction, unpack with an asterisk specification can generate a list of elements of unpredetermined length. For example, unpacking with C* creates one list element (a number) for each string character. So, the statement

@values = unpack("C*", "hello, world!\n");

yields a list of 14 elements, one for each of the characters of the string.


Previous: 16.1 Getting Password and Group InformationLearning PerlNext: 16.3 Getting Network Information
16.1 Getting Password and Group InformationBook Index16.3 Getting Network Information