Skip to content

Handling of Unicode Byte Order Marks

This section documents the finer points of Guile’s handling of Unicode byte order marks (BOMs). A byte order mark (U+FEFF) is typically found at the start of a UTF-16 or UTF-32 stream, to allow readers to reliably determine the byte order. Occasionally, a BOM is found at the start of a UTF-8 stream, but this is much less common and not generally recommended.

Guile attempts to handle BOMs automatically, and in accordance with the recommendations of the Unicode Standard, when the port encoding is set to UTF-8, UTF-16, or UTF-32. In brief, Guile automatically writes a BOM at the start of a UTF-16 or UTF-32 stream, and automatically consumes one from the start of a UTF-8, UTF-16, or UTF-32 stream.

As specified in the Unicode Standard, a BOM is only handled specially at the start of a stream, and only if the port encoding is set to UTF-8, UTF-16 or UTF-32. If the port encoding is set to UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE, then BOMs are not handled specially, and none of the special handling described in this section applies.

  • To ensure that Guile will properly detect the byte order of a UTF-16 or UTF-32 stream, you must perform a textual read before any writes, seeks, or binary I/O. Guile will not attempt to read a BOM unless a read is explicitly requested at the start of the stream.
  • If a textual write is performed before the first read, then an arbitrary byte order will be chosen. Currently, big endian is the default on all platforms, but that may change in the future. If you wish to explicitly control the byte order of an output stream, set the port encoding to UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE, and explicitly write a BOM (#\xFEFF) if desired.
  • If set-port-encoding! is called in the middle of a stream, Guile treats this as a new logical “start of stream” for purposes of BOM handling, and will forget about any BOMs that had previously been seen. Therefore, it may choose a different byte order than had been used previously. This is intended to support multiple logical text streams embedded within a larger binary stream.
  • Binary I/O operations are not guaranteed to update Guile’s notion of whether the port is at the “start of the stream”, nor are they guaranteed to produce or consume BOMs.
  • For ports that support seeking (e.g. normal files), the input and output streams are considered linked: if the user reads first, then a BOM will be consumed (if appropriate), but later writes will not produce a BOM. Similarly, if the user writes first, then later reads will not consume a BOM.
  • For ports that are not random access (e.g. pipes, sockets, and terminals), the input and output streams are considered independent for purposes of BOM handling: the first read will consume a BOM (if appropriate), and the first write will also produce a BOM (if appropriate). However, the input and output streams will always use the same byte order.
  • Seeks to the beginning of a file will set the “start of stream” flags. Therefore, a subsequent textual read or write will consume or produce a BOM. However, unlike set-port-encoding!, if a byte order had already been chosen for the port, it will remain in effect after a seek, and cannot be changed by the presence of a BOM. Seeks anywhere other than the beginning of a file clear the “start of stream” flags.