• Home
  • About
  • Methodology
  • Verification Standard
  • More
    • Home
    • About
    • Methodology
    • Verification Standard
  • Home
  • About
  • Methodology
  • Verification Standard

AI model calibration is a crucial process in ensuring the reliability of AI systems. Conducting an AI calibration audit allows practitioners to assess the accuracy and performance of models, including advanced techniques like DistilBERT calibration and Toxic BERT calibration. By focusing on AI reliability evaluation, organizations can better understand how well their AI models perform in real-world scenarios.

View Methodology
Avenridge Institute logo with stylized 'A' and 'I' in blue and gold.

Independent verification and calibration for AI systems

We evaluate whether model confidence survives contact with reality using pre-registered, reproducible methodologies. 

Learn More

PUBLIC CASE STUDY SUMMARY

DistilBERT SST-2 Calibration Evaluation

DistilBERT SST-2 Calibration Evaluation

DistilBERT SST-2 Calibration Evaluation

Executive Summary


Avenridge Institute conducted a pre-registered calibration evaluation as part of an AI calibration audit of the publicly available model: 


distilbert-base-uncased-finetuned-sst-2-english


using the SST-2 validation dataset.


The objective was not just to assess raw classification accuracy, but to perform an AI reliability evaluation to determine whether the model's probability outputs were statistically calibrated in relation to observed outcomes, specifically focusing on DistilBERT calibration and Toxic BERT calibration.


The evaluation protocol was locked before execution following a well-documented pre-registration methodology.

Key Result

DistilBERT SST-2 Calibration Evaluation

DistilBERT SST-2 Calibration Evaluation

The AI model calibration achieved: 


Accuracy: 91.06% 

Brier Skill Score: +0.6664 


However, the model failed the pre-registered calibration criterion, highlighting the importance of an AI calibration audit. The strongest observed pattern was bimodal overconfidence: Predictions near 0% confidence were substantially more wrong than implied by the model probabilities, while predictions near 100% confidence exhibited significant overconfidence. 


This evaluation demonstrated that strong benchmark accuracy does not necessarily imply reliable probabilistic calibration, underscoring the need for thorough AI reliability evaluation. Additionally, the findings raise questions about DistilBERT calibration and Toxic BERT calibration, as both models may also struggle with similar overconfidence issues.

Get the Report

Contact Us

avenridgeinstitute.com

Hours

Open today

09:00 am – 05:00 pm

Request an Evaluation

Attach Files
Attachments (0)

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Cancel

Copyright © 2026 avenridgeinstitute.com - All Rights Reserved.

Powered by

  • Home
  • About
  • Methodology
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • Verification Standard
  • Git Discipline

This website uses cookies.

 

Avenridge uses limited cookies and analytics to support site functionality, security, and aggregate traffic insights.

We do not use invasive behavioral advertising or sell visitor data.

DeclineAccept