Understand Byte-Pair Encoding Tokenization in 1 min
1) For NLP tasks, we want to tokenize strings into integers (in fixed size unique vocab pool) 2) Unicode code point (0-1,114,111) is an extension to standard ASCII code (0-127), and covers non-English language characters as well. 3) Why not just Unicode code point as the integer representation for the character? Unique vocab pool is huge (all languages in the pool), hence very sparse in a high-dim space, not efficient. Also, unicode standard may be up to change. 4) UTF-encoding, e.g. UTF-8, will encode each Unicode code point using a few bytes (1 to 4 bytes for UTF-8, depending on the character) [a byte is 8 bits]. Note the resulting byte sequences are not exactly the Unicode code point, but there is a mapping given by the UTF encoding rule. Also note that if we are only dealing with strict English text, then each character is just one byte, since UTF encoding is backward compatible with standard ASCII (0-127). Also note that for characters in extended ASCII (128-255) UTF-8 uses two or more bytes. 5) But just use the byte sequences is still not good enough, because it is too long (each character can have 1 to 4 bytes via utf-8), and it consumes too much length in context window for NLP tasks. 6) Byte pairing encoding helps to mitigate this issue and compresses the byte sequence by iteratively replacing frequent character pairs by a new symbol with a new integer (e.g. (‘t’: 116, ‘h’: 104) -> a symbol that represents charater pair (‘t’, ‘h’) with a new integer id not in the current integer id set, like 257 for example) 7) Applying this iteratively, with each round replacing all occurrences of the most frequent pair at that round with a new token, until the max rounds have been reached. The max round is a hyperparameter we can tune on.As a result, we have tokens with integer id going from 0 to (255 + number of rounds of pair replacements) and have a more compressed integer list representation of the original text.