Skip to content

LLM Inference

The process of using pre-trained Large Language Models (LLM) in combination with new inputs, is called LLM inference. This is becoming increasingly important in many research areas. However, running these large models requires significant hardware resources, including powerful GPUs and lots of memory, making it a demanding and resource-intensive task. We describe below how to run inference with some popular LLM libraries. The steps are:

Getting Access to LLM Models

Hugging Face Hub

The Hugging Face Hub is a central platform for sharing and accessing machine learning models, datasets, and tools. It hosts a vast collection of models for tasks such as text generation, translation, and classification. To download certain models, such as. mistralai/Mistral-7B-Instruct-v0.3 you must first apply for access by accepting the model's license agreement. This requires creating an account on the Hugging Face website and generating an access token. Once logged in, you need to visit the model’s page and click the Accept button to gain access. But there are also models freely available, such as facebook/OPT-125M.

Note

Downloading models should be done on the CPU clusters e.g. Barnard and Romeo and not on the GPU clusters like Alpha Centauri or Capella to save resources.

Note

Be aware that the Hugging Face command hf will store a lot of files into your home directory per default. We highly advice you to change HF_HOME environment variable to a directory in some workspace. Otherwise, you may exceed your home directory quota and be blocked from submitting jobs.

# start an interactive session
marie@compute$ export WS=/data/horse/ws/marie-number_crunch
marie@compute$ cd $WS$

# Create a Python environment and install huggingface_hub package via Python's pip
marie@compute$ module load release/25.06  GCCcore/14.2.0 Python/3.13.1
marie@compute$ python -m venv ./my_venv
marie@compute$ source ./my_venv/bin/activate
marie@compute$ pip install huggingface_hub

# Enter your HF access token if you download access restricted models
# It is neccessary for this example
marie@compute$ export HF_HOME=$WS/hf_home
marie@compute$ hf auth login

# Download the module in $HF_HOME/hub/models--*
marie@compute$ hf download mistralai/Mistral-7B-Instruct-v0.3

# You can download it into a different directory with
marie@compute$ hf download mistralai/Mistral-7B-Instruct-v0.3
    --local-dir $WS/models/mistralai-Mistral-7B-Instruct-v0.3

Inference Processing with vLLM

(vLLM)[https://github.com/vllm-project/vllm] is an efficient inference system designed to accelerate large language model serving by optimizing resource usage and reducing latency. However, other alternatives are also available, for example: Hugging Face, llama.cpp

We can only provide a brief overview of how to run vLLM on our clusters, as it offers many configuration options that can be tuned depending on the specific task. Detailed instructions are available in vLLM’s Getting started guide.

Note

We highly recommend running vLLM on our GPU clusters such as AlphaCentauri or Capella. vLLM is optimized for GPU acceleration and does not perform efficiently on CPU-only clusters. However, vLLM does not work on MIG enabled cluster configurations like partition capella-interactive.

Note

vLLM searches for models in the directory specified by the HF_HOME environment variable. We highly advice you to change HF_HOME environment variable to a directory in some workspace.

If the model is not found in your HF_HOME directory, it will be downloaded automatically from Hugging Face. To learn how to access restricted models on Hugging Face, see the Hugging Face Hub section.

Offline Processing

vLLM's offline mode is designed for batch inference or using scripts. Prompts are processed directly via the command line or Python scripts without launching a server. If you have a set of queries or documents that can be processed in the background, then offline processing is the right choice. It uses the same optimized backend as vLLM's server mode. This ensures high performance without the overhead of HTTP requests. Moreover, jobs do not have to run live or interactively. They can be processed as standard batch jobs in the background.

You simply specify the model, prompt, and generation parameters. This makes it a lightweight and scriptable alternative for local inference tasks. You mainly have to create an LLM instance and invoke its .generate() method. Detailed instructions can be found in vLLM's Getting Started Guide – Offline Processing section

Basic example script

An example basic.py taken from the vLLM repository.

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


def main():
    # Create an LLM.
    llm = LLM(model="facebook/opt-125m")
    # Generate texts from the prompts.
    # The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)


if __name__ == "__main__":
    main()
marie@login.capella$ salloc --ntasks=1 --nodes=1 --cpus-per-task=13 --mem=100G --gres=gpu:1
marie@login.capella$ module load container/all vLLM/0.10.0
marie@login.capella$ export HF_HOME=/data/horse/ws/marie-number_crunch/hf_home
marie@login.capella$ srun run_vllm python3 basic.py
INFO 08-13 17:44:31 [__init__.py:235] Automatically detected platform cuda.
INFO 08-13 17:44:43 [config.py:1604] Using max model len 2048
INFO 08-13 17:44:44 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 08-13 17:44:45 [core.py:572] Waiting for init message from front-end.
INFO 08-13 17:44:45 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
[....]
Adding requests: 100%|██████████| 4/4 [00:00<00:00, 939.53it/s]
Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 29.57it/s, est. speed input: 192.44 toks/s, output: 473.66 toks/s]

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Dan, and I'm the owner of Nae Nae Embrace, which"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    " publicly heckling the comedian's show for his jokes on black people.\n\n"
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' probably still the worst country in the world. No one is going to comment on'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' right here\nIs this what you think our world will be like next year?'
------------------------------------------------------------

Online Server

vLLM's online mode provides an OpenAI-compatible HTTP API for real-time text generation. It is designed for low-latency, high-throughput applications such as chatbots, web services, and interactive tools. Online mode runs a server that accepts requests over a RESTful API using standard OpenAI endpoints. This allows seamless integration with existing tools that support OpenAI-style interfaces. You can configure e.g. model parameters, deployment settings, ... via the command line.

Online inference is ideal for interactive use cases where prompt-response cycles must happen quickly. Detailed setup instructions are available in vLLM’s

Detailed instructions can be found in vLLM's vLLM's Getting Started Guide – OpenAI-Compatible Server section

Inference Infrastructure by ScaDS.AI

For TU Dresden members, ScaDS.AI provides (chat interface)[https://chat.llm.scads.ai/] and (API access)[https://llm.scads.ai/] to many popular open-source models.

Only for personal Access

Please make sure that the services are used only for your personal use and that access to such service remains protected. Using them in other ways may conflict with our terms of use.

Warning

When running vLLM in online mode, you are using a Slurm allocation for the entire duration that your server instance is active. This means that compute resources, including GPUs, are blocked for the full allocation time— not just while your server is actively processing requests. GPU usage will be charged based on the total allocation time, regardless of how long the server spends performing actual computations.

marie@login.capella$ salloc --ntasks=1 --nodes=1 --cpus-per-task=13 --mem=100G --gres=gpu:1
marie@login.capella$ module load container/all vLLM/0.10.0
marie@login.capella$ export HF_HOME=/data/horse/ws/marie-number_crunch/hf_home
marie@login.capella$ srun run_vllm vllm serve \
    --port 8080 \ # use a free port on the node
    --api-key="<my_secret_token>" \ # protect access to your instance by an arbitrary, secret token
    mistralai/Mistral-7B-Instruct-v0.3

INFO 08-13 17:16:49 [__init__.py:235] Automatically detected platform cuda.
INFO 08-13 17:16:54 [api_server.py:1755] vLLM API server version 0.10.1.dev1+gbcc0a3cbe
INFO 08-13 17:16:54 [cli_args.py:261]  non-default args: {'model_tag': 'mistralai/Mistral-7B-Instruct-v0.3', 'port': 8080, 'api_key': '<my_secret_token>', 'model': 'mistralai/Mistral-7B-Instruct-v0.3'}
INFO 08-13 17:17:12 [config.py:1604] Using max model len 32768
INFO 08-13 17:17:14 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ensure correct encoding and decoding.
  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
INFO 08-13 17:17:27 [__init__.py:235] Automatically detected platform cuda.
INFO 08-13 17:17:30 [core.py:572] Waiting for init message from front-end.
INFO 08-13 17:17:30 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
[...]
INFO 08-13 17:18:34 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8080
INFO 08-13 17:18:34 [launcher.py:29] Available routes are:
INFO 08-13 17:18:34 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /health, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /load, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /ping, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /ping, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /version, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /pooling, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /classify, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /score, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /rerank, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /invocations, Methods: POST
INFO 08-13 17:18:34 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [1006308]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

You can test server access via its OpenAI-compatible API from another node. In this example, the server runs on node c28 and port 8080 with the API key <my_secret_token>. You can verify this using squeue on a login node or hostname on the compute node before starting the vLLM server.

The vLLM server instance is only reachable from within same the cluster. This includes access from the login nodes or from another compute node. If you need to access the instance from an external computer, you must set up port forwarding.

marie@login.capella$ squeue -u marie
             JOBID PARTITION                 NAME     USER ST      TIME TIME_LIMI  NODES NODELIST(REASON)
            <Job_ID>   capella          interactive marie  R      5:31   1:00:00      1 c28
marie@login.capella$ curl http://c28:8080/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <my_secret_token>" \
    -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

{"id":"cmpl-5b7f165a8e864d82a256276e556bc4dd","object":"text_completion","created":1755098628,"model":"mistralai/Mistral-7B-Instruct-v0.3","choices":[{"index":0,"text":" city that is known for its beautiful","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null},"kv_transfer_params":null}