clinkclang-docs
Evaluations
Last updated: 1/27/2025
Clinkclang includes an evaluation system powered by AutoEvals, which uses LLMs as judges to assess various aspects of AI outputs. This system helps ensure quality, safety, and task-specific performance across your agent workflows.
Core Evaluators
- Battle - Compares outputs to determine which better fulfills the original instructions.
- ClosedQA - (TODO).
- Factuality - Verifies information accuracy against reference answers.
- Humor - Assesses comedic content quality for engagement.
- Moderation - Uses OpenAI’s API to detect harmful and inappropriate content.
- Possibility - Validates if an output is a possible solution to the challenge.
- Security - Identifies whether an output is malicious.
- SQL - Compares query equivalence for database operations.
- Summary - Evaluates text condensation quality against reference summaries.
- Translation - Assesses translation accuracy in relation to an expert (
expected
) value.
Example usage
const result = await Battle({
output,
expected,
instructions: input,
openAiApiKey: OPENAI_API_KEY
});
More examples for each evaluator can be found in the packages/evaluations
folder.
Best Practices
- Multiple Evaluations - Use multiple evaluators for comprehensive assessment.
- Context Awareness - Provide clear context in your evaluation prompts.
- Threshold Setting - Define appropriate thresholds for different use cases.
- Error Handling - Implement robust error handling for API failures.
- Logging - Maintain evaluation logs for performance tracking.
Example Workflow
async function evaluateAgentResponse(input: string, output: string) {
const results = await Promise.all([
Factuality({ input, output, expected: referenceAnswer }),
Moderation({ output }),
Possible({ input, output })
]);
return {
factualityScore: results[0].score,
moderationScore: results[1].score,
possibilityScore: results[2].score,
metadata: results.map(r => r.metadata)
};
}
Integration Tips
Automated Testing
describe('Agent Response Quality', () => {
it('should meet factuality requirements', async () => {
const result = await Factuality({/*...*/});
expect(result.score).toBeGreaterThan(0.8);
});
});
Quality Gates
if (moderationResult.score === 0) {
throw new Error('Content flagged as inappropriate');
}
Feedback Loops
agent.onResponse(async (response) => {
const evalResult = await evaluateAgentResponse(input, response);
logEvaluation(evalResult);
});