NAME
norminit, normpull, runecomp, runedecomp, runegbreak, runewbreak, utfcomp, utfdecomp, utfgbreak, utfwbreak – multi–rune graphemes

SYNOPSIS
#include <u.h>
#include <libc.h>

typedef struct Norm Norm;
struct Norm {
...      /* internals */
};
void     norminit(Norm *n, int comp, void *ctx, long (*getrune)(void *ctx));
long     normpull(Norm *n, Rune *dst, long max, int flush);
long     runecomp(Rune *dst, long ndst, Rune *src, long nsrc);
long     runedecomp(Rune *dst, long ndst, Rune *src, long nsrc);
long     utfcomp(char *dst, long ndst, char *src, long nsrc);
long     utfdecomp(char *dst, long ndst, char *src, long nsrc);
Rune*    runegbreak(Rune *s);
Rune*    runewbreak(Rune *s);
char*    utfgbreak(char *s);
char*    utfwbreak(char *s);

DESCRIPTION
These routines handle Unicode® abstract characters that span more than one codepoint. Normalization can be used to turn all codepoints into a consistent representation. This may be useful if a specific protocol requires normalization, or if the program is interested in semantically comparing irregular input.

The Norm structure is the core structure for all normalization routines. Norminit initializes the structure. If the comp argument is non–zero, the output will be normalized to NFC (precomposed runes), otherwise it will be normalized to NFD (decomposed runes). The getrune argument provides the input for normalization, with each call returning the next rune of input, and –1 on EOF. The ctx argument is stored and passed on to the getrune function in every call. Normpull provides the normalized output, writing at most max elements into dst. To implement normalization the Norm structure must buffer input until it knows that the context for a given base rune is complete. In order to accommodate callers which only have chunks of data to normalize at a time, the Norm structure maintains runes within its buffer even when getrune returns an EOF. The flush argument to normpull changes this behavior, and will instead flush out all runes within the structure's buffer when it receives an EOF from getrune. The return value of normpull is the number of runes written to the output. Normpull does not null–terminate the output string, however, null bytes are passed through untouched. As such, if the input is null terminated, so is the output.

Runecomp, runedecomp, utfcomp, and utfdecomp, are abstractions on top of the Norm structure. They are designed to normalize fixed–sized input in one go. In all functions src and dst specify the source and destination strings respectively. The nsrc and ndst arguments specify the number of elements to process. Functions will never read more than the specified input, and will never write more than the specified output. If there is not enough room in the output buffer, the result is truncated. The return value is likewise the number of elements written to the output string. Like normpull, these functions do not explicitly null terminate the output, and pass null bytes through untouched.

The standard for normalization does not specify a maximum number of decomposed attaching runes that may follow a base rune. In order to implement normalization, within a bounded amount of memory, these functions implement a subset of normalization called Stream–Safe Text. This subset specifies that one base rune may have no more than 30 attaching runes. In order to break up input that contains runs of more than 30 attaching runes, these functions will insert the Combining Grapheme Joiner (U+034F) to provide a new base for the remaining combining runes.

Runegbreak (runewbreak) return the next grapheme (word) break opportunity in s, or s if none is found. Utfgbreak and utfwbreak are UTF variants of these routines.

SOURCE
/sys/src/libc/ucd/mkrunetype.c
/sys/src/libc/ucd/runenorm.c
/sys/src/libc/ucd/runebreak.c

SEE ALSO
Unicode® Standard Annex #15
Unicode® Standard Annex #29
rune(2), utf(6), tcs(1)

HISTORY
This implementation was first written for 9front (March, 2023). The implementation was rewritten (in part) for Unicode 16.0 (March, 2025).