BFCL Restraint Analysis - From Single Model to Compositional Systems
Meta LLaMA Startup Program Collaboration
Function calling models often struggle with restraint - knowing when NOT to call a function. In BFCL testing, Llama 3.1 8B returned explanations instead of empty arrays 55% of the time. A simple prompt addition improved this to 41% failure rate. This pattern appears across many models and can be fixed with targeted interventions.
+13.83%
Live Irrelevance Improvement
0.58
Cohen's h Effect Size
(Large practical effect)
+26.5%
More Correct Empty Responses
89,900
8B Model Evaluations
(50 runs × 1,798 tests)
We extended the Llama 3.1 8B analysis to the complete model family. The results reveal a fundamental insight: interventions don't improve models universally but create specialized configurations. By composing these specialists—running the same model twice with different interventions—we achieve performance exceeding the largest frontier models.
Key Insight: Smaller models show the largest practical improvements. Scout and 8B achieve Cohen's h > 0.55 (large effects), while 405B shows diminishing returns. This enables compositional expert systems where specialized configurations handle specific detection tasks, achieving 43-90% energy reduction through cascade prevention.
📊 Total Multi-Model Evaluations: 449,500 (5 models × 5 interventions × 50 runs × 1,798 tests)
Comprehensive view of all 5 interventions tested across BFCL categories. Shows how systematic testing reveals optimal solutions, with zero_output dramatically outperforming other approaches.
View Full Comparison →Focused analysis of irrelevance and live_irrelevance improvements with Cohen's d annotations. Highlights the large effect sizes achieved through targeted intervention.
View Impact Analysis →Concrete demonstration of behavioral shift: from 557.8 to 705.4 average correct empty responses per run. Makes the improvement tangible and easy to understand.
View Behavioral Shift →Interactive comparison of all 5 models showing baseline vs intervention performance. Reveals model-specific response patterns and optimal intervention strategies for each architecture.
View Model Comparison →Complete effect size analysis using Cohen's h (proper metric for proportions). Shows all interventions across all models, revealing which combinations produce meaningful improvements.
View Effect Sizes →We used the fan methodology to test multiple interventions across diverse model architectures:
Zero Output and Anti-Verbosity: Explicitly permitting empty responses when no functions match. Effect sizes vary by model.
Smaller models (8B, Scout) show larger improvements. 405B shows resistance, suggesting optimization limits.
Same model + different interventions = specialized experts. Two-pass filtering achieves superior performance.