[{"data":1,"prerenderedAt":623},["ShallowReactive",2],{"running-mmlu-5shot-on-nemotron-nano-omni-with-a-dgx-spark":3},{"id":4,"title":5,"body":6,"comments":601,"date":602,"description":603,"draft":601,"extension":604,"external":605,"image":606,"meta":607,"navigation":608,"path":609,"seo":610,"stem":611,"tags":612,"__hash__":622},"blog/2026/06/18/running-mmlu-5shot-on-nemotron-nano-omni-with-a-dgx-spark.md","Running MMLU 5-shot on Nemotron Nano Omni with a DGX Spark",{"type":7,"value":8,"toc":591},"minimark",[9,43,54,59,71,74,120,133,137,140,146,167,172,191,195,198,208,218,226,235,242,250,262,283,287,290,295,317,320,325,329,340,345,348,352,363,368,371,376,380,398,401,454,457,555,558,562,579,587],[10,11,12,13,17,18,25,26,32,33,36,37,42],"p",{},"This is the first proper LLM ",[14,15,16],"strong",{},"evaluation"," I've run end-to-end, and I learned a ton — both about how evals actually work under the hood and about the infrastructure quirks of running one against a model hosted on my ",[19,20,24],"a",{"href":21,"rel":22},"https://www.nvidia.com/en-us/products/workstations/dgx-spark/",[23],"nofollow","DGX Spark",". I scored ",[14,27,28],{},[29,30,31],"code",{},"nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4"," on ",[14,34,35],{},"MMLU 5-shot"," and submitted the result to ",[19,38,41],{"href":39,"rel":40},"https://www.localmaxxing.com/en/evals/mmlu-5shot",[23],"localmaxxing.com",". This post walks through the result with interactive charts, a few example questions, and the setup that got me there.",[10,44,45,48,49,53],{},[14,46,47],{},"Headline: 73.5%"," (95% CI 72.8–74.2, over all 14,042 questions). Random guessing is 25%, so for a sparse MoE model with only ~3B ",[50,51,52],"em",{},"active"," parameters running at 4-bit, that's a genuinely strong score.",[55,56,58],"h2",{"id":57},"what-mmlu-actually-is","What MMLU actually is",[10,60,61,66,67,70],{},[19,62,65],{"href":63,"rel":64},"https://github.com/hendrycks/test",[23],"MMLU"," (Massive Multitask Language Understanding) is ",[14,68,69],{},"14,042 multiple-choice questions across 57 subjects"," — everything from abstract algebra to professional law to marketing — grouped into four broad categories (STEM, Humanities, Social Sciences, Other). It's a broad knowledge-and-reasoning test.",[10,72,73],{},"A couple of things that surprised me as a first-timer:",[75,76,77,88],"ul",{},[78,79,80,83,84,87],"li",{},[14,81,82],{},"\"5-shot\""," means every question is preceded by ",[14,85,86],{},"5 fully worked examples"," from the same subject, so the model learns the answer format before it sees the real question. Those few-shot prefixes make the prompts long (more on that below).",[78,89,90,93,94,97,98,101,102,105,106,105,109,105,112,115,116,119],{},[14,91,92],{},"The model never \"writes\" an answer."," This is the part I didn't expect. The standard MMLU task is scored by ",[14,95,96],{},"log-likelihood",": we feed the prompt ending in ",[29,99,100],{},"Answer:"," and measure the probability the model assigns to each of ",[29,103,104],{},"A",", ",[29,107,108],{},"B",[29,110,111],{},"C",[29,113,114],{},"D",". Whichever letter gets the highest probability is the model's pick. Because we're reading probabilities directly — not generating text — ",[14,117,118],{},"temperature and sampling settings are irrelevant",", and the model's \"reasoning\" mode never even fires. The score is just the fraction of questions where the top-probability letter matches the answer key.",[10,121,122,123,128,129,132],{},"I ran this with ",[19,124,127],{"href":125,"rel":126},"https://github.com/EleutherAI/lm-evaluation-harness",[23],"EleutherAI's lm-evaluation-harness"," (",[29,130,131],{},"v0.4.9.1",", the version localmaxxing pins for this suite).",[55,134,136],{"id":135},"where-the-model-is-strong-and-weak","Where the model is strong and weak",[10,138,139],{},"Here's accuracy broken down by category. Social Sciences and \"Other\" are clearly its strong suits; STEM and Humanities trail.",[141,142,143],"client-only",{},[144,145],"mmlu-category-chart",{},[10,147,148,149,154,155,158,159,162,163,166],{},"The category averages hide a lot of variance, though. Here's every one of the 57 subjects, sorted weakest to strongest and colored by category (",[150,151,153],"span",{"style":152},"color:#4C72B0","■"," STEM · ",[150,156,153],{"style":157},"color:#DD8452"," Humanities · ",[150,160,153],{"style":161},"color:#55A868"," Social Sciences · ",[150,164,153],{"style":165},"color:#C44E52"," Other). Hover any bar for the exact accuracy and sample size:",[141,168,169],{},[170,171],"mmlu-subject-chart",{},[10,173,174,175,178,179,182,183,186,187,190],{},"The pattern is intuitive: it crushes broad, verbal, \"general knowledge\" subjects (",[14,176,177],{},"high-school government & politics 93%, psychology 90%, biology 89%",") and struggles with dense symbolic reasoning and niche trivia (",[14,180,181],{},"global facts 43%, abstract algebra 49%, high-school math 49%, formal logic 53%","). Humanities gets dragged down by ",[29,184,185],{},"formal_logic"," and ",[29,188,189],{},"professional_law",", which are really logic/reasoning tests in disguise.",[55,192,194],{"id":193},"a-few-questions-up-close","A few questions, up close",[10,196,197],{},"Numbers are abstract, so here are three actual questions to give a feel for what the model is being asked.",[10,199,200,203,204,207],{},[14,201,202],{},"It nailed this one (high confidence, correct)"," — ",[50,205,206],{},"miscellaneous",":",[209,210,211],"blockquote",{},[10,212,213,214,217],{},"What kind of angle is formed where two perpendicular lines meet?\nA. obtuse · B. acute · ",[14,215,216],{},"C. right ✓ (model picked C)"," · D. invisible",[10,219,220,203,223,207],{},[14,221,222],{},"A genuinely hard miss",[50,224,225],{},"abstract algebra",[209,227,228],{},[10,229,230,231,234],{},"Find the degree for the given field extension Q(√2, √3, √18) over Q.\nA. 0 · ",[14,232,233],{},"B. 4 ✓"," · C. 2 (model picked C) · D. 6",[10,236,237,238,241],{},"That one requires actually reasoning about field extensions — and note √18 = 3√2 is ",[50,239,240],{},"not"," independent of √2, a classic trap. The model fell for it.",[10,243,244,203,247,207],{},[14,245,246],{},"And one that taught me evals aren't ground truth",[50,248,249],{},"computer security",[209,251,252],{},[10,253,254,255,257,258,261],{},"Three of the following are classic security properties; which one is ",[14,256,240],{},"?\nA. Confidentiality · ",[14,259,260],{},"B. Availability ✓ (answer key)"," · C. Correctness (model picked C) · D. Integrity",[10,263,264,265,268,269,272,273,275,276,279,280,282],{},"The model picked ",[14,266,267],{},"C. Correctness"," — and it's ",[50,270,271],{},"right",". The classic security triad is ",[14,274,111],{},"onfidentiality, ",[14,277,278],{},"I","ntegrity, ",[14,281,104],{},"vailability (the \"CIA triad\"), so \"Correctness\" is the one that isn't classic. The answer key says B (Availability), which is simply wrong. MMLU has a well-documented amount of label noise like this, and it's a good reminder that a few points of any benchmark score are just dataset errors. The model got \"penalized\" here for being more correct than the test.",[55,284,286],{"id":285},"how-long-are-these-prompts-anyway","How long are these prompts, anyway?",[10,288,289],{},"Because of the 5-shot prefix, the prompts aren't short. Here's the token-length distribution (measured with the model's own tokenizer):",[141,291,292],{},[293,294],"mmlu-length-hist",{},[10,296,297,298,301,302,305,306,309,310,312,313,316],{},"Mean is ~702 tokens, but the long tail matters: the dense history and law subjects push the ",[14,299,300],{},"longest 5-shot prompt to 3,111 tokens",". This bit me before I started — lm-eval's default ",[29,303,304],{},"max_length"," is ",[14,307,308],{},"2048",", which would have silently truncated ~2.7% of prompts (concentrated in exactly those long-context subjects) and quietly cost me points. Bumping ",[29,311,304],{}," to ",[14,314,315],{},"4096"," captured 100% of prompts with zero truncation.",[10,318,319],{},"Does length actually hurt accuracy? A little, but it's mostly a confound — the longest prompts belong to the hardest subjects:",[141,321,322],{},[323,324],"mmlu-length-accuracy",{},[55,326,328],{"id":327},"how-confident-is-it-and-is-that-confidence-trustworthy","How confident is it — and is that confidence trustworthy?",[10,330,331,332,335,336,339],{},"Since scoring is probabilistic, I can measure ",[14,333,334],{},"confidence"," as the gap between the top choice's log-probability and the runner-up's. A well-behaved model should be ",[50,337,338],{},"more"," confident when it's right. It is — the \"correct\" mass sits clearly to the right of \"incorrect\":",[141,341,342],{},[343,344],"mmlu-confidence-chart",{},[10,346,347],{},"That separation is exactly what you want to see: when the model is unsure (small gap), it's much more likely to be wrong. It \"knows when it knows.\"",[55,349,351],{"id":350},"is-it-biased-toward-any-answer-letter","Is it biased toward any answer letter?",[10,353,354,355,358,359,362],{},"A classic failure mode is favoring a position (e.g., always leaning \"C\") regardless of content. Comparing how often each letter is the ",[50,356,357],{},"correct"," answer versus how often the model ",[50,360,361],{},"picks"," it, the bias here is mild — a slight lean toward A/B and away from D:",[141,364,365],{},[366,367],"mmlu-bias-chart",{},[10,369,370],{},"And when it's wrong, what does it confuse for what? The diagonal is correct picks; off-diagonal shows the (fairly uniform) confusions:",[141,372,373],{},[374,375],"mmlu-confusion-chart",{},[55,377,379],{"id":378},"the-setup-a-dgx-spark-k3s-vllm-and-an-off-box-eval-client","The setup: a DGX Spark, k3s, vLLM, and an off-box eval client",[10,381,382,383,385,386,389,390,393,394,397],{},"The infrastructure was half the adventure. The model runs on my ",[14,384,24],{}," (GB10 Grace Blackwell, 128 GB unified memory) as a ",[14,387,388],{},"vLLM"," pod inside a k3s cluster, served NVFP4-quantized with FP8 KV cache. The eval ",[50,391,392],{},"client"," (lm-eval) ran on my Mac, hitting vLLM over the LAN via the OpenAI-compatible ",[29,395,396],{},"/v1/completions"," endpoint.",[10,399,400],{},"A few hard-won lessons:",[75,402,403,424,438,444],{},[78,404,405,408,409,412,413,416,417,420,421,423],{},[14,406,407],{},"Use the completions endpoint, not chat."," Log-likelihood / MCQ scoring needs prompt log-probs (",[29,410,411],{},"echo"," + ",[29,414,415],{},"logprobs","), which chat-completion APIs don't expose. ",[29,418,419],{},"local-completions"," against ",[29,422,396],{}," is the way.",[78,425,426,433,434,437],{},[14,427,428,429,432],{},"Run the eval client ",[50,430,431],{},"off"," the Spark."," vLLM pins ~103 GB of the 128 GB unified memory. My first instinct — run lm-eval on the Spark itself — pushed it into a swap death-spiral that throttled vLLM to ",[50,435,436],{},"25 seconds per request",". Moving the client to my Mac (the GPU work stays on the Spark; only orchestration + tokenization moves) fixed it instantly.",[78,439,440,443],{},[14,441,442],{},"The tokenizer must match the server."," I pointed lm-eval at the model's own tokenizer so the context-length bookkeeping for log-prob scoring lined up byte-for-byte with vLLM. Mismatches there silently corrupt scores.",[78,445,446,449,450,453],{},[14,447,448],{},"Checkpoint long runs."," The Spark rebooted ~90 minutes in. ",[29,451,452],{},"--use_cache"," meant the completed requests were banked, and a resilient wrapper resumed from the cache and finished the remaining work without losing anything.",[10,455,456],{},"The actual command, for the curious:",[458,459,464],"pre",{"className":460,"code":461,"language":462,"meta":463,"style":463},"language-bash shiki shiki-themes github-light github-dark monokai","lm_eval --model local-completions \\\n  --model_args model=nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4,\\\nbase_url=http://\u003Cspark>:8000/v1/completions,tokenizer=\u003Clocal-tokenizer>,\\\nnum_concurrent=16,max_length=4096,tokenized_requests=False \\\n  --use_cache ~/mmlu_cache --cache_requests true \\\n  --tasks mmlu --num_fewshot 5 --output_path ~/mmlu-full --log_samples\n","bash","",[29,465,466,485,497,506,514,531],{"__ignoreMap":463},[150,467,470,474,478,482],{"class":468,"line":469},"line",1,[150,471,473],{"class":472},"srTi1","lm_eval",[150,475,477],{"class":476},"s7F3e"," --model",[150,479,481],{"class":480},"sstjo"," local-completions",[150,483,484],{"class":476}," \\\n",[150,486,488,491,494],{"class":468,"line":487},2,[150,489,490],{"class":476},"  --model_args",[150,492,493],{"class":480}," model=nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4,",[150,495,496],{"class":476},"\\\n",[150,498,500,504],{"class":468,"line":499},3,[150,501,503],{"class":502},"sMOD_","base_url=http://\u003Cspark>:8000/v1/completions,tokenizer=\u003Clocal-tokenizer>,",[150,505,496],{"class":476},[150,507,509,512],{"class":468,"line":508},4,[150,510,511],{"class":502},"num_concurrent=16,max_length=4096,tokenized_requests=False ",[150,513,496],{"class":476},[150,515,517,520,523,526,529],{"class":468,"line":516},5,[150,518,519],{"class":476},"  --use_cache",[150,521,522],{"class":480}," ~/mmlu_cache",[150,524,525],{"class":476}," --cache_requests",[150,527,528],{"class":476}," true",[150,530,484],{"class":476},[150,532,534,537,540,543,546,549,552],{"class":468,"line":533},6,[150,535,536],{"class":476},"  --tasks",[150,538,539],{"class":480}," mmlu",[150,541,542],{"class":476}," --num_fewshot",[150,544,545],{"class":476}," 5",[150,547,548],{"class":476}," --output_path",[150,550,551],{"class":480}," ~/mmlu-full",[150,553,554],{"class":476}," --log_samples\n",[10,556,557],{},"End to end, the full 14,042-question run took roughly two hours of GPU time on the GB10 at a sustained ~6,500 prompt-tokens/sec.",[55,559,561],{"id":560},"takeaways","Takeaways",[75,563,564,570,573,576],{},[78,565,566,569],{},[14,567,568],{},"73.5% on MMLU 5-shot"," is a strong result for a ~3B-active, 4-bit model — competitive with much larger dense models from a year ago.",[78,571,572],{},"Its profile is \"broad knowledge generalist\": excellent at verbal/social subjects, weaker at symbolic math and logic.",[78,574,575],{},"Evals are not ground truth — label noise is real, and a model can be marked wrong for being right.",[78,577,578],{},"The plumbing matters as much as the model: endpoint choice, where the client runs, tokenizer alignment, and checkpointing all materially affect whether you get a clean number.",[10,580,581,582,586],{},"The run is submitted to the ",[19,583,585],{"href":39,"rel":584},[23],"localmaxxing MMLU 5-shot leaderboard",". All 14,042 per-question records (prompt, choices, the four log-likelihoods, prediction, confidence) are retained, so every chart above is reproducible from the raw data.",[588,589,590],"style",{},"html pre.shiki code .srTi1, html code.shiki .srTi1{--shiki-default:#6F42C1;--shiki-dark:#B392F0;--shiki-sepia:#A6E22E}html pre.shiki code .s7F3e, html code.shiki .s7F3e{--shiki-default:#005CC5;--shiki-dark:#79B8FF;--shiki-sepia:#AE81FF}html pre.shiki code .sstjo, html code.shiki .sstjo{--shiki-default:#032F62;--shiki-dark:#9ECBFF;--shiki-sepia:#E6DB74}html pre.shiki code .sMOD_, html code.shiki .sMOD_{--shiki-default:#24292E;--shiki-dark:#E1E4E8;--shiki-sepia:#F8F8F2}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html .sepia .shiki span {color: var(--shiki-sepia);background: var(--shiki-sepia-bg);font-style: var(--shiki-sepia-font-style);font-weight: var(--shiki-sepia-font-weight);text-decoration: var(--shiki-sepia-text-decoration);}html.sepia .shiki span {color: var(--shiki-sepia);background: var(--shiki-sepia-bg);font-style: var(--shiki-sepia-font-style);font-weight: var(--shiki-sepia-font-weight);text-decoration: var(--shiki-sepia-text-decoration);}",{"title":463,"searchDepth":487,"depth":487,"links":592},[593,594,595,596,597,598,599,600],{"id":57,"depth":487,"text":58},{"id":135,"depth":487,"text":136},{"id":193,"depth":487,"text":194},{"id":285,"depth":487,"text":286},{"id":327,"depth":487,"text":328},{"id":350,"depth":487,"text":351},{"id":378,"depth":487,"text":379},{"id":560,"depth":487,"text":561},false,"2026-06-18","My first LLM eval end-to-end: scoring NVIDIA's Nemotron-3-Nano-Omni-30B (NVFP4) on MMLU 5-shot via vLLM on a DGX Spark, with interactive charts breaking down where it's strong, where it's weak, and how confident it is.","md",null,"/static/mmlu/cover.png",{},true,"/2026/06/18/running-mmlu-5shot-on-nemotron-nano-omni-with-a-dgx-spark",{"title":5,"description":603},"2026/06/18/running-mmlu-5shot-on-nemotron-nano-omni-with-a-dgx-spark",[613,614,615,616,617,618,619,620,621],"ai","llm","eval","mmlu","vllm","nemotron","dgx-spark","k8s","localmaxxing","PsXJuIY08tve_S6kaBuTrKgd4824cbS7pNHkWQDcwBc",1781796218120]