Hypernym

Helping Function Calling Models of All Sizes

BFCL Restraint Analysis - From Single Model to Compositional Systems

Meta LLaMA Startup Program Collaboration

What We Found

Function calling models often struggle with restraint - knowing when NOT to call a function. In BFCL testing, Llama 3.1 8B returned explanations instead of empty arrays 55% of the time. A simple prompt addition improved this to 41% failure rate. This pattern appears across many models and can be fixed with targeted interventions.

+13.83%

Live Irrelevance Improvement

0.58

Cohen's h Effect Size
(Large practical effect)

+26.5%

More Correct Empty Responses

89,900

8B Model Evaluations
(50 runs × 1,798 tests)

Interactive Visualizations

Multi-Model Analysis: Beyond Single Model Improvements

We extended the Llama 3.1 8B analysis to the complete model family. The results reveal a fundamental insight: interventions don't improve models universally but create specialized configurations. By composing these specialists—running the same model twice with different interventions—we achieve performance exceeding the largest frontier models.

Llama 3.1 8B
h = 0.58
Large effect
Llama 3.3 70B
h = 0.48
Medium-large
Llama 3.1 405B
h = 0.32
Medium effect
Scout 17B
h = 0.57
Large effect
Maverick 17B
h = 0.43
Medium-large

Key Insight: Smaller models show the largest practical improvements. Scout and 8B achieve Cohen's h > 0.55 (large effects), while 405B shows diminishing returns. This enables compositional expert systems where specialized configurations handle specific detection tasks, achieving 43-90% energy reduction through cascade prevention.

📊 Total Multi-Model Evaluations: 449,500 (5 models × 5 interventions × 50 runs × 1,798 tests)

8B Model Deep Dive

📊 Fan Methodology Results

Comprehensive view of all 5 interventions tested across BFCL categories. Shows how systematic testing reveals optimal solutions, with zero_output dramatically outperforming other approaches.

View Full Comparison →

📈 Primary Improvements

Focused analysis of irrelevance and live_irrelevance improvements with Cohen's d annotations. Highlights the large effect sizes achieved through targeted intervention.

View Impact Analysis →

🔄 Behavioral Change

Concrete demonstration of behavioral shift: from 557.8 to 705.4 average correct empty responses per run. Makes the improvement tangible and easy to understand.

View Behavioral Shift →

Multi-Model Comparisons

🌟 Multi-Model Performance

Interactive comparison of all 5 models showing baseline vs intervention performance. Reveals model-specific response patterns and optimal intervention strategies for each architecture.

View Model Comparison →

📊 Cohen's h Effect Heatmap

Complete effect size analysis using Cohen's h (proper metric for proportions). Shows all interventions across all models, revealing which combinations produce meaningful improvements.

View Effect Sizes →

Testing Approach

We used the fan methodology to test multiple interventions across diverse model architectures:

What Worked

Zero Output and Anti-Verbosity: Explicitly permitting empty responses when no functions match. Effect sizes vary by model.

Model-Specific Patterns

Smaller models (8B, Scout) show larger improvements. 405B shows resistance, suggesting optimization limits.

Compositional Strategy

Same model + different interventions = specialized experts. Two-pass filtering achieves superior performance.