I'm open-sourcing my experimental custom NPU architecture designed for local AI acceleration

Hi all, Like many of you, I'm passionate about running local models efficiently. I've spent the recently designing a custom hardware architecture - an NPU Array (v1) - specifically optimized for matrix multiplication and high TOPS/Watt performance for local AI inference. I've just open-sourced the entire repository here: Disclaimer: This is early-stage, experimental hardware design. It’s not a finished chip you can plug into a PCIe slot tomorrow. I am currently working on resolving routing congestion to hit my target clock frequencies.