Focalnet

“Focal Modulation Networks”¹

TODO link to my focalnet post

Vision Transformers - “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”²

TODO link to my vision transformer post

Relationship between learning rate and batch size³

“Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex” ⁴

Multi-branch things are usually easier to optimize, at the expense of having latency issues. These issues are solved in the MobileOne paper, which trains with the typical multi-branch archiecture, then collapses the branches into a single one finds an equivalent single branch layer )

“Traditional and Heavy-Tailed Self Regularization in Neural Network Models” - ⁵

Uses random matrix theory to explain why deeper or wider networks have certain generalization properties based on the spectral characteristics of their weight matrices. Spectral characteristics of any layer depends on the SVD of the weight matrix, which tell you which signals the layer is capable of amplifying

“Singular values represent the strengths of the signals that the matrices can capture or amplify. The singular values helps us understand how information flows through the network and how sensitive the network is to various patterns in the data.” -chatGPT

“Heavy-Tailed Distributions: A key observation in the paper is that the singular value distributions of weight matrices in trained neural networks often exhibit heavy-tailed properties. Heavy-tailed distributions are characterized by a higher probability of extreme values compared to normal (Gaussian) distributions. This implies that the weight matrices have a significant number of very large singular values, suggesting that the network can capture a wide range of features, from very common to very rare ones.” -chatGPT Basically, the SVD values of trained networks have bigger values than if the weights were Gaussian. We usually initialize to something that looks Gaussan, so the fact that training them makes them look different indicates there is some property of the trained weights that is useful or desirable.

General topics of interest

Prioritized Training Active learning, knowledge distillation, student-teacher learning Knowledge graphs Mind maps

semantic seach effcient, sub 1B specialized data retrieval LLMs that are called inside some other workflow …nervously, agents self supervised training for video AI aided developer tools multimodal pretraining tokenization, including in vision models

Types of architecture / training improvements numerical stability - batchnorm, gradient clipping, adam and gradient preconditioning expressability data efficiency - using model loss discrepancies to tell you something about what is learned and what can be learned

discouraging neuron conspiracy - dropout

LLM self aware character distribution

if you ask an LLM to output a text that is meaningful but has a uniform probability distribution over all characters, can it do it? Does the model know the actual letters in a word? Or there are probably tokens just for letters themselves, and it can use that as an intermediate step and somewhat keep track

test this by asking for random text and random text with the charcter distribution constraints, count the characters and plot the distributions. repeat for robustness.
get a list of apis you want to test
sample each api 100 times
for each sample, count the number of characters
- store in a nested dictionary?
- {‘claude’: {1: ‘a’: 203, ‘b’: 405’, …‘z’: }}
make two histograms and use wasserstein distance to measure how much of a difference the character aware prompt is

Sample prompt A

Write out 1000 words, whatever you want to write about. For me, this will be random. There are no constraints.

Sample prompt B

Write out 1000 words, whatever you want to write about. For me, this will be random. There are no constraints. The only constraint is that I want you to try and write sentences that have as uniform of a distribution over the characters as possible. The ideal output for 260 characters would have 10 a’s, 10 b’s, … and 10 z’s.

I will be analyzing the distribution of characters in your output, so feel free to have full freedom to express your underlying distribution

or does this end up falling to zipfs law what if instead of doing it with characters, you do it with a tokenizer? need to use the same tokenizer as each respective model uses, right?

Dirichlet distribution (multivariate Beta distribution) Beta Distribution -“In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial, negative binomial, and geometric distributions.” - wikipedia

Bayesian hyperparameter optimization

https://arxiv.org/abs/2203.11926 ↩
https://arxiv.org/abs/2010.11929 ↩
https://arxiv.org/abs/1711.00489 ↩
https://proceedings.mlr.press/v89/zhang19d/zhang19d.pdf ↩
https://arxiv.org/pdf/1901.08276.pdf ↩

Focalnet

“Focal Modulation Networks”1

Vision Transformers - “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”2

Relationship between learning rate and batch size3

“Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex” 4

“Traditional and Heavy-Tailed Self Regularization in Neural Network Models” - 5