The interesting BDH question: What if LLM memory lived in the network weights instead of the ever-growing KV cache?
r/singularity
•
Generative AI
NLP
I've seen BDH come up in a few discussion threads, but I couldn't find a compact explanation of what the architecture is actually claiming. I found jan chorowski's seminar and took notes, so posting the short version here in case it saves others the full watch. I'm exploring post-transformer architectures, so treat this as my understanding of one architecture, please correct it and not a definitive take. I