Hypernym Jupiter Methodology: BFCL Function Calling Vulnerability Detection

+13.83%

Live Irrelevance Improvement

0.58

Cohen's h Effect Size
(Large practical effect)

+26.5%

More Correct Empty Responses

89,900

8B Model Evaluations
(50 runs × 1,798 tests)

Interactive Visualizations

Multi-Model Analysis: Beyond Single Model Improvements

We extended the Llama 3.1 8B analysis to the complete model family. The results reveal a fundamental insight: interventions don't improve models universally but create specialized configurations. By composing these specialists—running the same model twice with different interventions—we achieve performance exceeding the largest frontier models.

Llama 3.1 8B

h = 0.58

Large effect

Llama 3.3 70B

h = 0.48

Medium-large

Llama 3.1 405B

h = 0.32

Medium effect

Scout 17B

h = 0.57

Large effect

Maverick 17B

h = 0.43

Medium-large

Key Insight: Smaller models show the largest practical improvements. Scout and 8B achieve Cohen's h > 0.55 (large effects), while 405B shows diminishing returns. This enables compositional expert systems where specialized configurations handle specific detection tasks, achieving 43-90% energy reduction through cascade prevention.

📊 Total Multi-Model Evaluations: 449,500 (5 models × 5 interventions × 50 runs × 1,798 tests)

8B Model Deep Dive

📊 Fan Methodology Results

Comprehensive view of all 5 interventions tested across BFCL categories. Shows how systematic testing reveals optimal solutions, with zero_output dramatically outperforming other approaches.

View Full Comparison →

📈 Primary Improvements

Focused analysis of irrelevance and live_irrelevance improvements with Cohen's d annotations. Highlights the large effect sizes achieved through targeted intervention.

View Impact Analysis →

🔄 Behavioral Change

Concrete demonstration of behavioral shift: from 557.8 to 705.4 average correct empty responses per run. Makes the improvement tangible and easy to understand.

View Behavioral Shift →

Multi-Model Comparisons

🌟 Multi-Model Performance

Interactive comparison of all 5 models showing baseline vs intervention performance. Reveals model-specific response patterns and optimal intervention strategies for each architecture.

View Model Comparison →

📊 Cohen's h Effect Heatmap

Complete effect size analysis using Cohen's h (proper metric for proportions). Shows all interventions across all models, revealing which combinations produce meaningful improvements.

View Effect Sizes →

Testing Approach

We used the fan methodology to test multiple interventions across diverse model architectures:

What Worked

Zero Output and Anti-Verbosity: Explicitly permitting empty responses when no functions match. Effect sizes vary by model.

Model-Specific Patterns

Smaller models (8B, Scout) show larger improvements. 405B shows resistance, suggesting optimization limits.

Compositional Strategy

Same model + different interventions = specialized experts. Two-pass filtering achieves superior performance.

Helping Function Calling Models of All Sizes

What We Found