
AI Agentçè¯ä¼°ä¸æµè¯ï¼å¦ä½éåæºè½ä½æ§è½AI Agentçè½åæ£å¨å¿«éæ¼è¿ï¼ä»ç®åçé®ç婿å°è½å¤èªä¸»è§åãè°ç¨å·¥å ·ã宿夿任å¡çæºè½ç³»ç»ãä½å¦ä½ç§å¦è¯ä¼°ä¸ä¸ªAgentçè½åæ°´å¹³ï¼å¦ä½ç¡®ä¿å®å¨å®é é¨ç½²ä¸è¡¨ç°å¯é ï¼æ¬æå°ç³»ç»ä»ç»AI Agentçè¯ä¼°ç»´åº¦ãèªå¨åæµè¯æ¡æ¶ä»¥å主æµåºåæµè¯æ¹æ³ï¼å¸®å©å¼åè æå»ºå®åçAgentè¯æµä½ç³»ãä¸ã为ä»ä¹Agentè¯ä¼°å¦æ¤å¤æä¸ä¼ ç»è½¯ä»¶æå䏿¨¡åä¸åï¼AI Agentå ·æä»¥ä¸ç¹ç¹ï¼ä½¿å¾è¯ä¼°å 满ææï¼å¤è½®äº¤äºæ§ï¼Agentéè¦å¨å¤è½®å¯¹è¯ä¸ä¿æç¶æä¸è´æ§å·¥å ·è°ç¨è½åï¼æ¶åå¤é¨APIãæ°æ®åºãä»£ç æ§è¡ç heterogeneous å·¥å ·èªä¸»è§åï¼æ§è¡è·¯å¾éç¡®å®æ§ï¼åä¸ä»»å¡å¯è½æå¤ç§æ£ç¡®è§£æ³é¿ç¨ä¾èµï¼ä»»å¡æ¥éª¤é´åå¨å æé¾ï¼æ©æé误å¯è½çº§èæ¾å¤§ç¯å¢äº¤äºï¼éè¦ä¸çå®ææ¨¡æç¯å¢å¨æäº¤äºå æ¤ï¼Agentè¯ä¼°ä¸è½ç®åå¥ç¨LLMçperplexityæBLEUææ ï¼éè¦è®¾è®¡æ´å ¨é¢çè¯ä¼°æ¡æ¶ãäºãæ ¸å¿è¯ä¼°ç»´åº¦2.1 ä»»å¡å®æçï¼Task Success Rateï¼ä»»å¡å®æçæ¯æç´è§çè¯ä¼°ææ ï¼è¡¡éAgentå¨ç»å®ä»»å¡ä¸çæåçãfrom dataclasses import dataclass from typing import List, Optional, Any dataclass class TaskResult: 任塿§è¡ç»æ task_id: str success: bool # æ¯å¦æå宿 completion_rate: float # å®æåº¦ï¼0.0-1.0ï¼ steps_taken: int # æ§è¡æ¥æ° max_steps: int # æå¤§å è®¸æ¥æ° time_elapsed: float # èæ¶ï¼ç§ï¼ final_answer: str # æç»è¾åº gold_answer: str # æ åçæ¡ class TaskSuccessEvaluator: ä»»å¡å®æçè¯ä¼°å¨ def __init__(self, tolerance: float 0.05): self.tolerance tolerance # æ°å¼æ¯è¾å®¹å·® def exact_match(self, predicted: str, expected: str) - bool: 精确å¹é return predicted.strip() expected.strip() def contains_match(self, predicted: str, expected: str) - bool: å å«å¹é ï¼é¢æµå 嫿£ç¡®çæ¡å³å¯ return expected.strip().lower() in predicted.strip().lower() def numeric_match(self, predicted: str, expected: str) - bool: æ°å¼å¹é ï¼æ¯æå®¹å·®æ¯è¾ try: p_val float(predicted.replace(,, )) e_val float(expected.replace(,, )) return abs(p_val - e_val) / max(abs(e_val), 1e-10) self.tolerance except ValueError: return False def evaluate(self, results: List[TaskResult]) - dict: è¯ä¼°ä¸æ¹ä»»å¡ç»æ total len(results) success_count sum(1 for r in results if r.success) avg_completion sum(r.completion_rate for r in results) / total avg_steps sum(r.steps_taken for r in results) / total # æçææ ï¼æ¯å¦å¨åçæ¥æ°å 宿 efficient_count sum( 1 for r in results if r.success and r.steps_taken r.max_steps * 0.8 ) return { success_rate: success_count / total, avg_completion_rate: avg_completion, avg_steps: avg_steps, efficiency_rate: efficient_count / total, total_tasks: total } # 使ç¨ç¤ºä¾ results [