Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

Hello I have been working on creating a LLM from ground up. It is based on deepseek architecture with heavily VRAM footprint reduced optimized(GUM+muon) Currently this is the json schema I am using which should suffice as to what currently is being pretrained. I have 2 6000 pro 600W Testing a 7B parameter model with 64 experts. currently running on single GPU with 100% throughput (hardest part) (~80