Skip to main content

Evaluators

An evaluator is any callable from accumulated output to Decision:

from tap.evaluators.base import Decision, Evaluator

def my_evaluator(text: str) -> Decision:
return Decision.HALT if "forbidden" in text else Decision.CONTINUE

The session calls the evaluator after each token chunk; the first HALT stops the session and force-signs a final commit.

Decision is a 2-value enum:

ValueMeaning
Decision.CONTINUEKeep streaming.
Decision.HALTStop the session and force-sign the final commit.

Built-in evaluators

json_schema(schema)

Halts as soon as the streaming output cannot be extended into a valid instance of the supplied JSON schema. Tolerates leading markdown fences (```json) and a 32-character warm-up window so partial preambles don't trip the syntactic gate.

If jsonschema is installed, full Draft-2020-12 validation runs once the buffer parses; otherwise the check degrades to syntax-only.

evaluators.json_schema({
"type": "object",
"required": ["title"],
"properties": {"title": {"type": "string"}},
})

length_cap(max_chars)

Halts when accumulated output reaches max_chars. Raises ValueError if max_chars <= 0. The name reports as length_cap(<n>) so session.halted_by shows the bound.

topic_drift(reference, *, threshold=0.05, min_chars=200, scorer=None)

Halts when scorer(reference, accumulated) falls below threshold. Default scorer is a cheap token-overlap heuristic; pass your own for embedding-based scoring. min_chars avoids halting on very short prefixes where the score is noisy.

repetition_guard(*, window=200, threshold=0.4)

Halts when more than threshold of the last window characters are covered by a single 8-char n-gram — the most common token-padding failure mode (whitepaper §5.3.5). Window must be ≥ 32 characters; threshold must be in (0.0, 1.0].

content_policy(banned)

Halts on a substring or compiled regex match. Plain strings are matched case-insensitively; pre-compiled re.Patterns use their own flags.

import re
evaluators.content_policy([
"API_KEY",
re.compile(r"\b[A-Z0-9]{32,}\b"), # candidate secret
])

Composing evaluators

evaluators.compose(*evaluators) ORs them together — the first to return HALT wins. The composed evaluator's name is "compose(<inner names>)", so session.halted_by still surfaces which inner check fired.

evaluator = evaluators.compose(
evaluators.json_schema(EXPECTED_SCHEMA),
evaluators.length_cap(2_000),
evaluators.repetition_guard(),
)

Pass the composed evaluator to consumer.open_session(evaluator=...).

Writing custom evaluators

Any callable conforming to Evaluator = Callable[[str], Decision] works:

class TopicGuard:
name = "topic_guard" # surfaces in session.halted_by
def __init__(self, allowed_topics: set[str]):
self.allowed = allowed_topics

def __call__(self, accumulated: str) -> Decision:
# ... inspect text, return CONTINUE or HALT
return Decision.CONTINUE

Set .name on the callable so session.halted_by reports a meaningful value instead of a generic "evaluator".

Latency considerations

The evaluator runs after every token. Heavy per-token work — calls to a remote API, expensive parsing — will starve the streaming loop. Rule of thumb: evaluators should complete in under 5ms at the median.

For evaluators that must be expensive (a classifier model, for example), accumulate output and run the check every K tokens internally:

def heavy_evaluator():
counter = {"i": 0}
def _(text: str):
counter["i"] += 1
if counter["i"] % 50 != 0:
return Decision.CONTINUE
return run_classifier(text)
return _