Jiachen Liu

AI Developers' Club

Activity

Mon

Wed

Fri

Sun

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

What is this?

Less

Memberships

AI Developers' Club

Public • 25 • Free

3 contributions to AI Developers' Club

Jiachen Liu

Jul 4

#General

Understand KAN network in 1 min

KAN’s idea is simple: remove the linear transformation constraint in the traditional neural network architecture and directly learn a nonlinear transformation function. (Recall that in the traditional neural networks, the learnable parameters are the linear transformation matrices, and the nonlinearity, i.e. the activation function, is fixed as hyperparameters) KAN is based on the Kolmogorov-Arnold representation theorem, which states that any scalar valued multivariate continuous function can be written as sums of scalar valued single variable functions. This theorem naturally gives rise to a neural network architecture where each edge between two nodes is a nonlinear function phi and only the sum operation is performed at the node. In terms of the training procedure, we choose a functional form for the nonlinear function phi (or even use another neural network to approximate this phi), and the coefficients for the function phi are the learnable parameters that we train via backpropogation. Benefits of KAN: • Greater expressive power due to the removal the linear transformation constraint. In certain scenarios, this could leads to greater sample efficiency • More granular control: In traditional MLP, usually the same activation function is applied globally (of course you can customize these). In KAN, each nonlinear function is learned individually and hence gives more “local effects” Disadvantages of KAN: • Too much expressive power (unconstrained) and hence we need to impose other kind of restrictions, e.g. the choice of nonlinear functions. In the paper, the authors chose B-spline functions, but this could be anything and we could even use a separate neural network to approximate the phi function. Is KAN really a paradigm shift? • From a mathematical point of view, we are certainly still in the function approximation regime, but with a different abstraction. If we think of MLP as the relaxed use of the universal function approximation theorem (e.g. via the use of non-continuous activation functions like ReLU which does not strictly obey the requirements of the theorem), KAN would be a relaxed use of the Kolmogorov-Arnold representation theorem (via the arbitrary number of layers of the KAN network and the “arbitrary” choice of function phi). Also, the relaxation of the linearity constraint makes MLP a subset of KAN.

Jiachen Liu

Jul 3

#General

Understand Byte-Pair Encoding Tokenization in 1 min

1) For NLP tasks, we want to tokenize strings into integers (in fixed size unique vocab pool) 2) Unicode code point (0-1,114,111) is an extension to standard ASCII code (0-127), and covers non-English language characters as well. 3) Why not just Unicode code point as the integer representation for the character? Unique vocab pool is huge (all languages in the pool), hence very sparse in a high-dim space, not efficient. Also, unicode standard may be up to change. 4) UTF-encoding, e.g. UTF-8, will encode each Unicode code point using a few bytes (1 to 4 bytes for UTF-8, depending on the character) [a byte is 8 bits]. Note the resulting byte sequences are not exactly the Unicode code point, but there is a mapping given by the UTF encoding rule. Also note that if we are only dealing with strict English text, then each character is just one byte, since UTF encoding is backward compatible with standard ASCII (0-127). Also note that for characters in extended ASCII (128-255) UTF-8 uses two or more bytes. 5) But just use the byte sequences is still not good enough, because it is too long (each character can have 1 to 4 bytes via utf-8), and it consumes too much length in context window for NLP tasks. 6) Byte pairing encoding helps to mitigate this issue and compresses the byte sequence by iteratively replacing frequent character pairs by a new symbol with a new integer (e.g. (‘t’: 116, ‘h’: 104) -> a symbol that represents charater pair (‘t’, ‘h’) with a new integer id not in the current integer id set, like 257 for example) 7) Applying this iteratively, with each round replacing all occurrences of the most frequent pair at that round with a new token, until the max rounds have been reached. The max round is a hyperparameter we can tune on.As a result, we have tokens with integer id going from 0 to (255 + number of rounds of pair replacements) and have a more compressed integer list representation of the original text.

Jiachen Liu

Jul 2

#General

[Casual ML] A crude look at LoRA's mechanism

LoRA (Low-rank Adaption) and fine-tuning of LLM In one sentence, what LoRA does is to express a matrix as the product of two low rank matrices, hence saving the number of parameters needed. For example, if matrix M (dim: 100 x 100, 10000 parameters) = Matrix B (dim: 100 x 20, 2000 parameters) multiply with Matrix A (dim: 20 x 100: 2000 parameters), the original matrix M takes 10k parameters, but Matrix B and A takes a total of 4k parameters. This is a massive saving in terms of the number of parameters! The remaining story is 1) Why is this decomposition valid? (weight update has low intrinsic dimensions) 2) How exactly is LoRA implemented for fine-tuning? (decompose weight update into product of low rank matrices and learn the weights for the low rank matrices instead) To elaborate: The intuition for LoRA is based on prior study’s result which showed that there exists a projection on a low-dim subspace for the full parameter space that captures the core aspects of the model. i.e., there is a low dimensional representation of the weight matrices that is “sufficiently good”. We take this one step further and hypothesize that the when fine tuning the weights, the weight update matrices may also have a low intrinsic dimension. And the experiments showed that this hypothesis is “correct” and works. How is it done? 1) When we train neural networks by backpropagation, the weight update is proportional to negative gradients. 2) Here, we let the weight update matrix delta_W = BA and we learn the weights for A and B, which essentially means we are fitting the weight update matrix. 3) The weights for the pretrained model, say W_0 stays fixed during the process and we are only updating B and A. 4) The weights for the forward pass for the next iteration is W_new = W_0 + BA, which basically means that instead of calculating gradient w.r.t W, we calculate gradient w.r.t B and A, and by updating B and A, we learn the weight update for W and then update W for the next forward pass.

1-3 of 3

Level 2

15points to level up

Jiachen Liu

@jiachen-liu-8513

Build interesting AI Apps.

Active 17d ago

Joined Jul 2, 2024

Contributions

Followers

Following