Ollama Batch Evaluation Guide (LLM-as-a-Judge)#

This page was generated from ollama-interactive-inference/ollama-sif-batch-eval-ibex.ipynb. You can view or download notebook. Or view it on nbviewer

Objective#

This guide helps you evaluate multiple model responses automatically using Ollama’s batch evaluation feature. Instead of manually scoring outputs, an LLM acts as a judge, comparing predictions against reference answers or quality criteria you define.

Initial Setup#

If you have not completed the initial Conda environment setup and JupyterLab access steps, please refer to OLLama on Ibex Guide - Approach-2: Notebook Workflows (Jupyter Bases).

Starting the Ollama Server#

Start the OLLAMA REST API server using the following bash script in a terminal:

The script has the following:

A user editable section, where the user defines Ollama models scratch directory.
The allocated port is saved in a temporary ollama_port.txt file, in order to be used in the Python notebook to read the assigned port to Ollama server.
Cleanup section in order to stop the singularity instance when the script is terminated.

User Modification Section#

This section of the script is reserved for user-specific setup to set the directory where the Ollama models are pulled locally.

In the script, you will find a clearly marked block:

# ------------------------------------
# START OF USER MODIFICATION SECTION
# ------------------------------------

Note: Do not modify other parts of the script unless you are sure, as they are required for correct execution.

import os, subprocess

script_content = """
#!/bin/bash

# Pre-start cleanup: ensure no stale instances or files
pre_cleanup() {
    echo "Running pre-start cleanup..."

    # 1. Stop any running Singularity instance with the same name
    if singularity instance list | grep -q "$SINGULARITY_INSTANCE_NAME"; then
        echo "Stopping existing Singularity instance: $SINGULARITY_INSTANCE_NAME"
        singularity instance stop "$SINGULARITY_INSTANCE_NAME"
    fi

    # 2. Remove old temporary or state files
    if [ -n "$OLLAMA_PORT_TXT_FILE" ] && [ -f "$OLLAMA_PORT_TXT_FILE" ]; then
        echo "Removing old port file: $OLLAMA_PORT_TXT_FILE"
        rm -f "$OLLAMA_PORT_TXT_FILE"
    fi

    if [ -n "$OLLAMA_LOG_FILE" ] && [ -f "$OLLAMA_LOG_FILE" ]; then
        echo "Removing old log file: $OLLAMA_LOG_FILE"
        rm -f "$OLLAMA_LOG_FILE"
    fi

    echo "Cleanup complete — ready to start new instance."
}

# Cleanup process while exiting the server
cleanup() {
    echo "🧹   Cleaning up before exit..."
    # Put your exit commands here, e.g.:
    rm -f $OLLAMA_PORT_TXT_FILE
    # Remove the Singularity instance
    singularity instance stop $SINGULARITY_INSTANCE_NAME
}
trap cleanup SIGINT  # Catch Ctrl+C (SIGINT) and run cleanup
pre_cleanup

# --------------------------------
# START OF USER MODIFICATION SECTION
# --------------------------------
# Make target directory on /ibex/user/$USER/ollama_models_scratch to store your Ollama models
export OLLAMA_MODELS_SCRATCH=/ibex/user/$USER/ollama_models_scratch
# --------------------------------
# END OF USER Editable Section
# --------------------------------

mkdir -p $OLLAMA_MODELS_SCRATCH

SINGULARITY_INSTANCE_NAME='ollama'
SINGULARITY_SIF_FILE="${SINGULARITY_INSTANCE_NAME}.sif"
OLLAMA_PORT_TXT_FILE='ollama_port.txt'
LOG_FILE=$PWD/ollama_server.log

# 2. Load Singularity module
module load singularity

# 3. Pull OLLAMA docker image
singularity pull --name $SINGULARITY_SIF_FILE docker://ollama/ollama

# 4. Change the default port for OLLAMA_HOST: (default 127.0.0.1:11434)
export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

# 5. Copy the assigned port, it will be required in the second part during working on the notebook.
echo "$PORT" > $OLLAMA_PORT_TXT_FILE

echo "OLLAMA PORT: $PORT  -- Stored in $OLLAMA_PORT_TXT_FILE"

# 6. Define the OLLAMA Host
export SINGULARITYENV_OLLAMA_HOST=127.0.0.1:$PORT

# 7. Change the default model directory stored: 
export SINGULARITYENV_OLLAMA_MODELS=$OLLAMA_MODELS_SCRATCH

# 8. Create an Instance:
singularity instance start --nv -B "/ibex/user:/ibex/user" $SINGULARITY_SIF_FILE $SINGULARITY_INSTANCE_NAME

# 7. Run the OLLAMA REST API server on the background
nohup singularity exec instance://$SINGULARITY_INSTANCE_NAME bash -c "ollama serve" > $LOG_FILE 2>&1 &
echo "Ollama server started. Logs at: $LOG_FILE"
"""

# Write script file
script_path = "ollama-server-start.sh"
with open(script_path, "w") as f:
    f.write(script_content)
os.chmod(script_path, 0o755)

# Run script
subprocess.run(["bash", script_path])

Running pre-start cleanup...
Cleanup complete — ready to start new instance.
Loading module for Singularity
Singularity 3.9.7 modules now loaded
OLLAMA PORT: 34321  -- Stored in ollama_port.txt

ollama-server-start.sh: line 9: singularity: command not found
FATAL:   Image file already exists: "ollama.sif" - will not overwrite

Ollama server started. Logs at: /ibex/user/solimaay/scripts/jupyter/631115-ollama-sif/ibex-nb/ollama_server.log

INFO:    instance started successfully

CompletedProcess(args=['bash', 'ollama-server-start.sh'], returncode=0)

Using Ollama Packages Requests#

Follow the following Python notebook below, it contains the codes for:

Initialization Setup.
List local models.
Pull models.
Testing connection to the Ollama server.
Chat with the models.

1. Initialization#

Define the base URL for the remote Ollama Server.
Testing the Ollama server connectivity.

import asyncio
from ollama import AsyncClient
from typing import List, Dict

MAX_CONCURRENT = 2  # limit to avoid GPU overload

# Configuration
with open("ollama_port.txt") as f :
    PORT = f.read().strip()
    
BASE_URL=f"http://127.0.0.1:{PORT}"
print(BASE_URL)

http://127.0.0.1:34321

# Testing the server connectivity
import requests

try:
    r = requests.get(BASE_URL)
    print("Ollama is running!", r.status_code)
except requests.ConnectionError as e:
    print("Ollama is NOT reachable:", e)

Ollama is running! 200

2. Get a List of Local Models#

Get a list of locally available Ollama models.
Locally available models are located under path: /ibex/user/$USER/ollama_models_scratch
To change the location for pulled models, modify the variable OLLAMA_MODELS_SCRATCH in the scriptstart_ollama_server.sh

from ollama import Client
client = Client(
  host=BASE_URL,
)

def get_local_models():
    """
    Returns a list of locally available Ollama Models.

    Returns:
        list: A list of model names as strings
    """
    models = [model['model'] for model in client.list()['models']]
    return models

# Usage
get_local_models()

['phi3:3.8b', 'qwen3:0.6b', 'gemma3:270m']

3. Pull The Model#

To pull a specific model, use pull method.
Please refer to Ollama Library to check available models.

# Pull the required models
client.pull("qwen3:0.6b")

ProgressResponse(status='success', completed=None, total=None, digest=None)

4. Running Batch Eval#

async def ensure_model_exists(client: AsyncClient, model: str):
    """
    Ensure that a specified Ollama model is available locally.  
    If the model is not installed, it will be pulled from the server.

    Args:
        client (AsyncClient): An instance of the AsyncClient connected to the Ollama server.
        model (str): Name of the model to check and pull if necessary.

    Raises:
        Exception: If pulling or checking the model fails.
    """
    try:
        # Check if the model is already available
        await client.show(model)
        print(f"Model {model} already available locally.")
    except Exception:
        # Pull the model if it does not exist
        print(f"Pulling model {model}...")
        async for progress in await client.pull(model, stream=True):
            status = progress.get("status", "")
            if "completed" in status.lower():
                print(f"Pulled {model} successfully.")
        print(f"Model {model} is now ready for use.")

async def query_model_async(client: AsyncClient, model: str, prompt: str) -> str:
    """
    Send a single prompt to a specified Ollama model asynchronously and return the full response.

    Args:
        client (AsyncClient): An instance of AsyncClient connected to the Ollama server.
        model (str): Name of the Ollama model to query.
        prompt (str): The user input to send to the model.

    Returns:
        str: The complete response text from the model.

    Raises:
        Exception: If the chat request fails.
    """
    messages = [{"role": "user", "content": prompt}]
    response = ""
    
    async for chunk in await client.chat(model=model, messages=messages, stream=True):
        if chunk.get("message") and "content" in chunk["message"]:
            response += chunk["message"]["content"]
    return response

async def run_batch(models: List[str], prompt: str) -> Dict[str, str]:
    """
    Run multiple model inferences concurrently while limiting active requests.

    This function uses asynchronous concurrency control to efficiently query 
    multiple models in parallel, ensuring that no more than `max_concurrent` 
    requests are active at a time. Each model is checked for availability before 
    being queried, and missing models are automatically pulled.

    Args:
        models (List[str]): A list of model names to query.
        prompt (str): The user input or question to be sent to each model.
        base_url (str, optional): The base URL of the Ollama API endpoint.
        max_concurrent (int, optional): The maximum number of models to run concurrently.

    Returns:
        Dict[str, str]: A dictionary mapping model names to their response text.
    """
    client = AsyncClient(host=BASE_URL)
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)

    async def safe_query(model):
        async with semaphore:
            # Ensure model is available before querying
            await ensure_model_exists(client, model)
            print(f"Running {model}...")
            result = await query_model_async(client, model, prompt)
            print(f"Done: {model}")
            return model, result

    results = await asyncio.gather(*(safe_query(m) for m in models))
    return dict(results)

async def judge_model_responses(
    client: AsyncClient,
    judge_model: str,
    responses: Dict[str, str],
    criteria: str
) -> Dict[str, Dict[str, str]]:
    """
    Use an LLM to evaluate the outputs of other models according to a given criteria.

    Args:
        client (AsyncClient): An instance of AsyncClient connected to the Ollama server.
        responses (Dict[str, str]): A dictionary mapping model names to their outputs.
        criteria (str): Evaluation criteria to guide the judging process.
        judge_model (str, optional): The model used as the judge. Defaults to "llama3".

    Returns:
        Dict[str, Dict[str, str]]: A dictionary mapping each evaluated model to its judgment,
        containing keys like "evaluation" with the judge's reasoning and score.

    Raises:
        Exception: If model evaluation fails or judge model cannot be ensured.
    """
    judged = {}
    
    # Ensure the judge model exists locally
    await ensure_model_exists(client, judge_model)

    for model, answer in responses.items():
        judge_prompt = f"""
You are an impartial judge. Evaluate the following answer according to {criteria}.

Answer:
{answer}

Give:
- Reasoning
- Score from 1 to 10
"""
        eval_resp = await query_model_async(client, judge_model, judge_prompt)
        judged[model] = {"evaluation": eval_resp}
        
    return judged

async def main():
    # Define List of Models [model_1, model_2, ...]
    models = ['qwen3:0.6b', 'gemma3:270m']
    # Define the Judge LLM model
    judge_model = "phi3:3.8b"
    # Define the prompt
    prompt = "What is 1+1 equal?"
    client = AsyncClient(host=BASE_URL)

    # 1. Run models (with auto-pull if missing)
    responses = await run_batch(models, prompt)

    # 2. Judge responses
    evaluations = await judge_model_responses(client, judge_model, responses, "clarity, correctness, and conciseness")

    # 3. Display
    for m in models:
        print(f"\n--- {m} ---")
        print("Response:\n", responses[m])
        print("Evaluation:\n", evaluations[m]["evaluation"])

await main()

Model qwen3:0.6b already available locally.
Running qwen3:0.6b...
Model gemma3:270m already available locally.
Running gemma3:270m...
Done: gemma3:270m
Done: qwen3:0.6b
Model phi3:3.8b already available locally.

--- qwen3:0.6b ---
Response:
 1 + 1 equals 2.
Evaluation:
 Reasoning: The provided response is succinct, accurate in terms of basic arithmetic principles (one plus one indeed yields two), and unambiguous due to the straightforward language used. Therefore, it effectively conveys a simple mathematical truth with minimal words required.

Score: I would give this answer an 8 out of 10 for clarity, correctness, and conciseness. The response could potentially be enhanced by providing context or stating that "one plus one equals two" explicitly defines the operation performed (addition), yet it remains unambiguous without such additions. However, as a standalone statement about an arithmetic fact, its simplicity is commendable, warranting this high score with just slight room for improvement.

--- gemma3:270m ---
Response:
 1 + 1 = 2

Evaluation:
 
Reasoning: The provided statement presents a basic arithmetic operation which is the addition of two integers, both equal to one. This expression directly follows mathematical conventions for representing simple summation and provides an immediate answer without any ambiguity or extraneous information. As such, it adheres well to principles requiring clarity, correctness, and conciseness in presenting basic calculations within a mathematics context.

Score: 10/10 - The statement is clear, completely accurate for the operation performed (simple addition), and succinctly conveys the necessary information without superfluous details or unnecessary complexity that would detract from its quality as an answer to this mathematical expression problem.

Stop the Ollama Server#

Make sure to stop the Ollama server by terminating the Singularity container.

import subprocess
import os

def stop_singularity_instance(instance_name="ollama", log_file=None, port_file=None):
    """
    Gracefully stop a running Singularity instance by name, 
    and optionally remove associated log or port files.
    """
    print(f"Checking for Singularity instance: {instance_name}")

    # 1. Check if instance is running
    try:

        result = subprocess.run(
            'bash -lc "module load singularity 2>/dev/null || true; singularity instance list"',
            shell=True,
            capture_output=True,
            text=True
        )

        if instance_name not in result.stdout:
            print(f"No running instance named '{instance_name}' found.")
        else:
            print(f"Instance '{instance_name}' is running. Attempting to stop it...")
            stop_result = subprocess.run(
                f'bash -lc "module load singularity 2>/dev/null || true; singularity instance stop {instance_name}"',
                shell=True,
                capture_output=True,
                text=True
            )
            if stop_result.returncode == 0:
                print(f"Singularity instance '{instance_name}' stopped successfully.")
            else:
                print(f"Warning: Failed to stop instance '{instance_name}'.")
                print(stop_result.stderr)

    except FileNotFoundError:
        print("Singularity command not found. Ensure it's installed and in PATH.")
        return

    # 2. Optional cleanup for files
    if port_file and os.path.exists(port_file):
        os.remove(port_file)
        print(f"Removed port file: {port_file}")

    if log_file and os.path.exists(log_file):
        os.remove(log_file)
        print(f"Removed log file: {log_file}")

    print("Cleanup complete.")

stop_singularity_instance(
    instance_name="ollama",
    log_file=os.path.expandvars("$PWD/ollama_server.log"),
    port_file=os.path.expandvars("$PWD/ollama_port.txt")
)

Checking for Singularity instance: ollama
Instance 'ollama' is running. Attempting to stop it...
Singularity instance 'ollama' stopped successfully.
Cleanup complete.