================================================================================
  AUTOMATED EXPERIMENT ANALYSIS REPORT
================================================================================
  Model: gpt-4o
  Runs: 5 per condition x 5 conditions x 3 datasets = 75
  Errors: 0
  Datasets: ops, saas, sales

────────────────────────────────────────────────────────────────────────────────
  CONSENSUS SCORES (avg of Judge_norm + OE_LLM + GEval)
────────────────────────────────────────────────────────────────────────────────
  Condition                    Judge/10  OE_LLM  GEval  CONSENSUS     Cost    Lat
  ---------------------------- -------- ------- ------ ---------- -------- ------
  Raw Prompt                       8.43   0.829  0.914      0.862 $0.00700  14.4s
  Template (Executive)             8.48   0.846  0.894      0.863 $0.00700  12.7s <-- baseline
  Template (Comprehensive)         8.10   0.711  0.887      0.803 $0.01100  23.8s
  LangChain LCEL                   8.47   0.620  0.890      0.786 $0.02300  40.2s
  CrewAI Crew                      8.95   0.662  0.941      0.833 $0.08700  47.8s

────────────────────────────────────────────────────────────────────────────────
  RANKINGS
────────────────────────────────────────────────────────────────────────────────
  #1 Template (Executive)         consensus=0.863  cost=$0.00700
  #2 Raw Prompt                   consensus=0.862  cost=$0.00700
  #3 CrewAI Crew                  consensus=0.833  cost=$0.08700
  #4 Template (Comprehensive)     consensus=0.803  cost=$0.01100
  #5 LangChain LCEL               consensus=0.786  cost=$0.02300

────────────────────────────────────────────────────────────────────────────────
  HEAD-TO-HEAD: Template (Executive) vs Alternatives
────────────────────────────────────────────────────────────────────────────────

  vs Raw Prompt:
    Quality: COMPARABLE (B=0.863 vs 0.862)
    Cost:    1.0x less expensive
    >>> Template B wins on both quality and cost

  vs Template (Comprehensive):
    Quality: WINS by 0.060 (B=0.863 vs 0.803)
    Cost:    1.6x more expensive
    >>> Template B achieves same/better quality at 1.6x lower cost

  vs LangChain LCEL:
    Quality: WINS by 0.077 (B=0.863 vs 0.786)
    Cost:    3.3x more expensive
    >>> Template B achieves same/better quality at 3.3x lower cost

  vs CrewAI Crew:
    Quality: WINS by 0.030 (B=0.863 vs 0.833)
    Cost:    12.4x more expensive
    >>> Template B achieves same/better quality at 12.4x lower cost

────────────────────────────────────────────────────────────────────────────────
  STATISTICAL SIGNIFICANCE HIGHLIGHTS
────────────────────────────────────────────────────────────────────────────────
   LLM Judge | B vs C_template_comprehensive diff=-0.383 (worse) p=0.0218 * d=-0.95
   LLM Judge | B vs E_crewai_crew        diff=+0.467 (better) p=0.0103 * d=1.03
      OE LLM | B vs C_template_comprehensive diff=-0.135 (worse) p=0.0039 ** d=-1.35
      OE LLM | B vs D_langchain_lcel     diff=-0.226 (worse) p=0.0000 *** d=-2.80
      OE LLM | B vs E_crewai_crew        diff=-0.184 (worse) p=0.0001 *** d=-2.03
       GEval | B vs A_raw                diff=+0.021 (better) p=0.0399 * d=0.79
       GEval | B vs E_crewai_crew        diff=+0.047 (better) p=0.0000 *** d=2.05

  Non-significant comparisons: 5 of 12

────────────────────────────────────────────────────────────────────────────────
  EVALUATOR AGREEMENT CHECK
────────────────────────────────────────────────────────────────────────────────
  Rank          Judge                OE_LLM               GEval             OE_Heuristic    
  #1         CrewAI Crew       Template (Executiv      CrewAI Crew           Raw Prompt     
  #2      Template (Executiv       Raw Prompt           Raw Prompt          CrewAI Crew     
  #3        LangChain LCEL     Template (Comprehe   Template (Executiv     LangChain LCEL   
  #4          Raw Prompt          CrewAI Crew         LangChain LCEL     Template (Comprehe 
  #5      Template (Comprehe     LangChain LCEL     Template (Comprehe   Template (Executiv 

  Template B ranked #1 by 1/3 evaluators

────────────────────────────────────────────────────────────────────────────────
  VERBOSITY BIAS DETECTION
────────────────────────────────────────────────────────────────────────────────
  Raw Prompt                   tokens=   958  judge_rank=#4  oe_rank=#2  heuristic_rank=#1
  Template (Executive)         tokens=  1142  judge_rank=#2  oe_rank=#1  heuristic_rank=#5
  Template (Comprehensive)     tokens=  1832  judge_rank=#5  oe_rank=#3  heuristic_rank=#4
  LangChain LCEL               tokens=  3782  judge_rank=#3  oe_rank=#5  heuristic_rank=#3
  CrewAI Crew                  tokens= 14816  judge_rank=#1  oe_rank=#4  heuristic_rank=#2 ⚠ POSSIBLE VERBOSITY BIAS (high judge, low OE_LLM, high tokens)

────────────────────────────────────────────────────────────────────────────────
  MONTHLY COST @ 1000 analyses/month
────────────────────────────────────────────────────────────────────────────────
  Raw Prompt                   $    7.00/month
  Template (Executive)         $    7.00/month
  Template (Comprehensive)     $   11.00/month
  LangChain LCEL               $   23.00/month
  CrewAI Crew                  $   87.00/month

  Annual savings vs LCEL:  $192
  Annual savings vs Crew:  $960

================================================================================
  FINAL VERDICT
================================================================================

  Template B (DataAnalyzer executive) ranks #1/5 by consensus score.
  Quality-per-dollar: 123 (highest = best)

  Quality claims:
    - beats LangChain LCEL by 0.077
    - beats CrewAI Crew by 0.030
  Cost claims:
    - 3.3x cheaper than LangChain LCEL
    - 12.4x cheaper than CrewAI Crew

  ✓ THESIS SUPPORTED: Structured templates deliver top-tier quality at lowest cost.
    Publishable claim: 'A single cognitive template matches multi-agent frameworks
    in output quality while being 3-12x cheaper and 3-4x faster.'

================================================================================