Week 0 Nvidia DGX Spark Experiments

One day I came into work and Harper (our CEO) asked me something along the lines of “What can we do with this NVIDIA Spark box?” I had no idea what it was since it hadn’t been released yet. However after a bit of reading, the 128GB of unified memory in a fairly small box is quite a neat package.

As a data scientist a lot of my time over the past decade has been training neural networks and one of the major constraints has been GPU RAM. This ranges from the old 1070/1080 cards up through a personal 4090 and the industry cards like A10s and a H100 at one point. While most common tasks are still very doable on an H100, having access to the 128GB from the NVIDIA box and having it in house is quite novel.

In order to see how the Spark box did, I decided I would do a few different tests

Run a Llama 4 model via Ollama for text generation tasks, benchmark this vs my M4 Apple Silicon Mac with 128GB of RAM
Configure it to run a new DeepSeek OCR model
Train a LoRA adapter for a Llama 3.1 70B model

I figured that this was a reasonable collection of tasks ranging from more basic standard pipelines like generation to getting a new model working which would involve more troubleshooting compatibility issues, and finally doing a training run with a reasonably large model that for the most part has been out of my standard ability to train.

“Hello World” running Llama 4 on NVIDIA Spark

After getting my connection setup and some very basic familiarity with the system I decided to try and pull down a Llama 4 models from meta and see some basic text generation with it. Installing and running Ollama and OpenWebGUI was a pretty quick process since the pipeline is well documented and understood, it took maybe 60 minutes where most of it was just downloading the Llama 4 model multiple times due to errors on my part.

I selected the Llama 4 Scout model which has 109B parameters where it only has a smaller 17B parameter subset active at any given time. I downloaded the model and when I ran it it was the expected 67GB or so of memory and interestingly once I ran this on the NVIDIA dashboard for this it was showing all of the memory as being allocated, however other diagnostics make me think it was closer to the expected 67GB.

So once we were at a point where we could generate text my next thought was this generation speed feels “ok… but I guess I have no idea how fast this actually is”.

Unsurprisingly it wasn’t aware of the new DGX spark which makes perfect sense

NVIDIA Spark vs M4 Apple Silicon 128GB

The DGX Spark box released on Oct 15, currently it is Oct 24 so I am guessing there will be future releases for software, drivers, etc that will smooth out some of the metrics and increase token generation speed but as of week one here are some baseline metrics.

Earlier this year our team purchased M4 Macbook Pros with 128Gb of unified memory which is quite convenient here. When I got my Macbook one of the first things I did was download some large models and run them to see how they did. After downloading maybe 1TB of models I haven’t bothered to do so again for awhile since most of my work recently hasn’t been around testing those models.

I ran a series of 5 prompts through a Llama 4 model via Ollama and the initial takeaway was “wow the M4 mac does way better”. For the early tests I was testing shortish prompts and output, which showed pretty consistent 30+ tokens/s (tokens per second) generation speed vs the DGX spark box’s ~16.8 tokens/s. So at a glance it would be easy to say that the M4 Apple Silicon architecture runs away with this test. However something interesting was that as the context length increased the token generation speed on the M4 got slower (as expected) but the NVIDIA one was basically constant at 16.8 tokens/s.

This led me to add some additional test where I fed the entire Pokemon Red, Blue, Yellow wikipedia page in as context and told it to generate a long piece of text. This wikipedia page is ~6,000 words or so and to generate an essay about the games the M4 showed a degraded speed down to 18.81 tokens/s (38% decrease) vs the DGX Spark at 16.0 tokens/s (5% decrease). So while the M4 is still slightly faster, the DGX Spark doesn’t show any of the speed degradation I would have expected.

Takeaways

As of now Oct 24 the NVIDIA DGX Spark with a Llama 4 model is ~79.6% the speed of an Apple Silicon M4 machine on generation tasks when both have 128GB of unified RAM. However the DGX Spark doesn’t exhibit the standard degradation of generation speed as the context length increases. Since it is still early I expect more optimizations to be made and the speed will get faster if these changes don’t cause the same sorts of degradation then the DGX Spark box could be very interesting. However I expect that once the proper kernels/software is added to the Spark ecosystem then the framework will have faster generation speeds that will get slower with longer input/output sequences. That will probably bring it much closer to the M4 numbers once it is more mature.

Reading through some other blogs I saw a good comment around how the ARM64 architecture on the DGX spark is mature due to other devices like the Jetson boards, but the Blackwell GB10 GPU is new and CUDA 13.0 has just been released. So this means it will take a bit of time to catch up to more fully fleshed out ecosystems.

Over time once this Blackwell GB10 GPU ecosystem is more fleshed out I would expect to see faster token generation speeds and then the normal performance degradation as we get into longer input and output contexts.

From a price perspective the NVIDIA Spark box is reasonably competitive. When we got our laptops they cost $4,849 while the Spark runs at $4000. So depending on your use case you could get access to 128GB of unified memory to run models on at a lower price at an albeit slower speed than other options for now.

I think that once the ecosystem is fleshed out a bit more then the NVIDIA Spark could be a good option to test or host models where you don’t need the absolute fastest response times or host a number of smaller models leveraging the 128GB of memory.

DeepSeek OCR

I’m doing my first play through of Hollow Knight Silksong

This section was inspired by Simon Willison’s blog here

The DeepSeek model is a 3B parameter model, my environment had an appropriate CUDA 13 setup since I had done some tests previously. Using a docker setup largely based off of the setup that they used for their Pytorch tutorials here. The main piece is the docker container nvcr.io/NVIDIA/pytorch:25.09-py3 and then from there we can configure it as we need via a dockerfile (in the appendix).

Claude Code Issues

Initially it had some issues finding a version of transformers and Pytorch that would work with this pipeline. It tried Pytorch 2.7 + CUDA 13 but couldn’t find a version that was compatible with ARM64 and CUDA so it went to Pytorch 2.9 + CUDA 13. Then similarly there was an issue with transformers versioning; there was an import error with LlamaFlashAttention2 which was solved by moving to an older version of transformers than it tried to use initially, so it pinned transformers==4.46.3 which solved the immediate issues around getting it to run.

After this I ran into an issue that in retrospect Simon also ran into an issue where saved output files initially only had whitespace. Claude raised a victory flag saying it ran the model, but when i told it to look into the files it was confused and assumed the model hadn’t detected anything. I pushed back that the Silksong wallpaper does indeed have text. For that it just had to figure out the correct structure, it was just printing out the content instead of writing it to the file we wanted.

From the wallpaper we got the following pieces of text and their detected coordinates. This wasn’t the most complicated test, but it gets all the sections and isolates them to the correct bounding box areas and handles the slightly odd text styles.

<|ref|>NINTENDO<|/ref|><|det|>[[24, 144, 115, 164]]<|/det|>
<|ref|>NINTENDO<|/ref|><|det|>[[160, 144, 240, 164]]<|/det|>
<|ref|>SWITCH.<|/ref|><|det|>[[24, 164, 115, 195]]<|/det|>
<|ref|>SWITCH.<|/ref|><|det|>[[160, 164, 240, 195]]<|/det|>
<|ref|>HOLLOW KNIGHT<|/ref|><|det|>[[384, 710, 600, 754]]<|/det|>
<|ref|>SILKSONG<|/ref|><|det|>[[299, 757, 699, 949]]<|/det|>

DeepSeek OCR Takeaways

This process was very smooth and took maybe 40-60 minutes. Building off of the docker containers that I knew worked on this system made the process easier and more repeatable. The only issues I hit were the standard sorts of one I would expect when using new models where you have to figure out appropriate Pytorch and transformers versions and then some oddness around how the model was displaying data but that issue was also quickly resolved.

This is quite nice overall since a few days after a new model came out I was able to get it running relatively painlessly on a new NVIDIA DGX Spark device.

Llama 3.1 70B LoRA training

For this I opted for a standard LoRA training run vs a full fine-tuning of the full model. I could likely have tested larger models but this was mostly a proof of concept to see how smooth this would mechanically work.

Methods

This test is me applying a small dataset (2K samples) I built for a different experiment and using it to benchmark how well this process works on the DGX Spark and for it I took a pipeline I had built in a jupyter notebook and migrated it over to a docker container based pipeline. For this I referenced their playbook on unsloth training in Pytorch. The main useful pieces were seeing how they configured their docker containers and set everything up. Once that was done all I had to do was swap in the new custom dataset I had built.

While not super important, the dataset was one I generated to train a small query decomposition model. The idea is that when a user submits a query, it may or may not be well aligned to the items we actually have within our RAG backend so the idea was to train a model to do this decomposition task by breaking the main query down into a set of N search queries.

Training and inference

After the standard issues around docker containers and figuring out what versions of Pytorch play well with what the training process was straightforward. It took ~8.5 hours to do 340 batches of size 64 through the model (it was 10 epochs), as expected the model overfit to the problem but I was mostly just here to see how smooth this process was.

Then the model does what we would expect on a new case, is it great? no… but that is mostly a dataset issue for a test thing i was doing before. The model learns to take in some query, generate a json response with a reasoning and decomposed query fields. The below section just shows what this looks like once we parse a result.

Input: How do I learn machine learning while working full-time as a software engineer?

Reasoning: User need combines professional education, technology field learning, and time management for working engineers.

Decomposed queries: ['machine learning learning plans', 'full-time work balance learning', 'software engineer education', 'time management for learning']

The model was around 37GB at 4 bit quantization, and the model + training took around 74Gb of the 128 total. This means we likely could do a larger model, I am unsure if we can fit a 200B parameter model the way their docs claim, but I have not really been able to do training runs using models at the 70B parameter size easily before so this was a nice change of pace.

Llama 70B LoRA training Takeaways

Like before, this machine is probably not the fastest way to do it, but the 128GB unified RAM does make it much simpler to physically be able to do these sorts of tasks. Like the idea that we can train in house adapters for a 70B parameter model without much difficulty is pretty great.

While training the peak used memory was only 74GB out of 128Gb which means we do have a good bit of leeway to train larger models or be more involved.

Interestingly reading through that AIXplore article it looks like I got lucky and dodged a few bullets around issue with inference using FP16 since I kept the model at 4 bit quantization given the size. Another issue the authors cite is that they had to manually empty the cache every 50 training steps. My guess is that Unsloth handles this under the hood since it includes warnings saying

Unsloth: Will smartly offload gradients to save VRAM!

Conclusions

Getting early access to the NVIDIA DGX Spark has been a fun experience to play with some interesting hardware early on in its lifecycle. The price point seems pretty good for access to 128GB of unified memory and once we get proper support for the ecosystem hopefully we get faster generation and training speeds which helps round out its performance.

For me it is nice to have access to the ability to train and host some of the larger models in house if we so chose, it helps to smooth out some of the annoyances of LLMs and how to make use of them. So I will be looking for how we can best utilize the DGX Spark as part of our pipelines as we move forward.

Appendix:

Docker File for DeepSeek OCR

# DeepSeek-OCR using NVIDIA PyTorch 25.09 container

# Following NVIDIA DGX Spark playbook approach

FROM nvcr.io/NVIDIA/pytorch:25.09-py3

# Set environment variables

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

LABEL maintainer="DeepSeek-OCR Pipeline"
LABEL description="NVIDIA PyTorch container with DeepSeek-OCR for vision-language OCR"

# PyTorch is already installed in this container with CUDA support

# Install DeepSeek-OCR specific dependencies

# NOTE: transformers 4.46.3 is required for DeepSeek-OCR compatibility

RUN pip install --no-cache-dir \
 'transformers==4.46.3' \
 'tokenizers==0.20.3' \
 accelerate \
 pillow \
 requests \
 huggingface_hub \
 einops \
 addict \
 easydict \
 matplotlib \
 timm

# Verify installations

RUN python3 -c "import torch; print(f'PyTorch: {torch.**version**}')" && \
 python3 -c "import transformers; print(f'Transformers: {transformers.**version**}')" && \
 python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" && \
 echo "✓ All packages installed successfully"

# Set working directory

WORKDIR /workspace

# Default command

CMD ["/bin/bash"]

Previous We Gave AI Agents Twitter and They Actually Got More Done Blog All posts