> **来源:[研报客](https://pc.yanbaoke.cn)** # TECHNICAL REPORT OF KIMI K2.5 # Kimi Team # ABSTRACT We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5 \times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint<sup>1</sup> to facilitate future research and real-world applications of agentic intelligence. Kimi K2.5 GPT-5.2 (xhigh) Claude Opus 4.5 Gemini 3 Pro Agents Humanity's Last Exam (Full) Agents BrowseComp Agents DeepSearchQA Coding SWE-bench Verified Coding SWE-bench Multilingual Image MMMU Pro Image MathVision Image OmniDocBench 1.5* Video VideoMMMU Figure 1: Kimi K2.5 main results. percentile $(\%)$ * OmniDocBench Score is computed as (1 - normalized Levenshtein distance) × 100, where a higher score denotes superior accuracy. Video LongVideoBench # 1 Introduction Large Language Models (LLMs) are rapidly evolving toward agentic intelligence. Recent advances, such as GPT-5.2 [41], Claude Opus 4.5 [6], Gemini 3 Pro [20], and Kimi K2-Thinking [1], demonstrate substantial progress in agentic capabilities, particularly in tool calling and reasoning. These models increasingly exhibit the ability to decompose complex problems into multi-step plans and to execute long sequences of interleaved reasoning and actions. In this report, we introduce the training methods and evaluation results of Kimi K2.5. Concretely, we improve the training of K2.5 over previous models in the following two key aspects. Joint Optimization of Text and Vision. A key insight from the practice of K2.5 is that joint optimization of text and vision enhances both modalities and avoids the conflict. Specifically, we devise a set of techniques for this purpose. During pre-training, in contrast to conventional approaches that add visual tokens to a text backbone at a late stage [8, 21], we find early vision fusion with lower ratios tends to yield better results given the fixed total vision-text tokens. Therefore, K2.5 mixes text and vision tokens with a constant ratio throughout the entire training process. Architecturally, Kimi K2.5 employs MoonViT-3D, a native-resolution vision encoder incorporating the NaViT packing strategy [15], enabling variable-resolution image inputs. For video understanding, we introduce a lightweight 3D ViT compression mechanism: consecutive frames are grouped in fours, processed through the shared MoonViT encoder, and temporally averaged at the patch level. This design allows Kimi K2.5 to process videos up to $4 \times$ longer within the same context window while maintaining complete weight sharing between image and video encoders. During post-training, we introduce zero-vision SFT—text-only SFT alone activates visual reasoning and tool use. We find that adding human-designed visual trajectories at this stage hurts generalization. In contrast, text-only SFT performs better—likely because joint pretraining already establishes strong vision-text alignment, enabling capabilities to generalize naturally across modalities. We then apply joint RL on both text and vision tasks. Crucially, we find visual RL enhances textual performance rather than degrading it, with improvements on MMLU-Pro and GPQA-Diamond. This bidirectional enhancement—text bootstraps vision, vision refines text—represents superior cross-modal alignment in joint training. Agent Swarm: Parallel Agent Orchestration. Most existing agentic models rely on sequential execution of tool calls. Even systems capable of hundreds of reasoning steps, such as Kimi K2-Thinking [1], suffer from linear scaling of inference time, leading to unacceptable latency and limiting task complexity. As agentic workloads grow in scope and heterogeneity—e.g., building a complex project that involves massive-scale research, design, and development—the sequential paradigm becomes increasingly inefficient. To overcome the latency and scalability limits of sequential agent execution, Kimi K2.5 introduces Agent Swarm, a dynamic framework for parallel agent orchestration. We propose a Parallel-Agent Reinforcement Learning (PARL) paradigm that departs from traditional agentic RL [2]. In addition to optimizing tool execution via verifiable rewards, the model is equipped with interfaces for sub-agent creation and task delegation. During training, sub-agents are frozen and their execution trajectories are excluded from the optimization objective; only the orchestrator is updated via reinforcement learning. This decoupling circumvents two challenges of end-to-end co-optimization: credit assignment ambiguity and training instability. Agent Swarm enables complex tasks to be decomposed into heterogeneous subproblems executed concurrently by domain-specialized agents, transforming task complexity from linear scaling to parallel processing. In wide-search scenarios, Agent Swarm reduces inference latency by up to $4.5 \times$ while improving item-level F1 from $72.8\%$ to $79.0\%$ compared to single-agent baselines. Kimi K2.5 represents a unified architecture for general-purpose agentic intelligence, integrating vision and language, thinking and instant modes, chats and agents. It achieves strong performance across a broad range of agentic and frontier benchmarks, including state-of-the-art results in visual-to-code generation (image/video-to-code) and real-world software engineering in our internal evaluations, while scaling both the diversity of specialized agents and the degree of parallelism. To accelerate community progress toward General Agentic Intelligence, we open-source our post-trained checkpoints of Kimi K2.5, enabling researchers and developers to explore, refine, and deploy scalable agentic intelligence. # 2 Joint Optimization of Text and Vision Kimi K2.5 is a native multimodal model built upon Kimi K2 through large-scale joint pre-training on approximately 15 trillion mixed visual and text tokens. Unlike vision-adapted models that compromise either linguistic or visual capabilities, our joint pre-training paradigm enhances both modalities simultaneously. This section describes the multimodal joint optimization methodology that extends Kimi K2 to Kimi K2.5. # 2.1 Native Multimodal Pre-Training A key design question for multimodal pre-training is: Given a fixed vision-text token budget, what is the optimal vision-text joint-training strategy. Conventional wisdom [8, 21] suggests introducing vision tokens predominantly in the later stages of LLM training at high ratios (e.g., $50\%$ or higher) should accelerate multimodal capability acquisition, treating multimodal capability as a post-hoc add-on to linguistic competence. Table 1: Performance comparison across different vision-text joint-training strategies. Early fusion with a lower vision ratio yields better results given a fixed total vision-text token budget. <table><tr><td></td><td>Vision Injection Timing</td><td>Vision-Text Ratio</td><td>Vision Knowledge</td><td>Vision Reasoning</td><td>OCR</td><td>Text Knowledge</td><td>Text Reasoning</td><td>Code</td></tr><tr><td>Early</td><td>0%</td><td>10%:90%</td><td>25.8</td><td>43.8</td><td>65.7</td><td>45.5</td><td>58.5</td><td>24.8</td></tr><tr><td>Mid</td><td>50%</td><td>20%:80%</td><td>25.0</td><td>40.7</td><td>64.1</td><td>43.9</td><td>58.6</td><td>24.0</td></tr><tr><td>Late</td><td>80%</td><td>50%:50%</td><td>24.2</td><td>39.0</td><td>61.5</td><td>43.1</td><td>57.8</td><td>24.0</td></tr></table> However, our experiments (as shown in Table 1 Figure 9) reveal a different story. We conducted ablation studies varying the vision ratio and vision injection timing while keeping the total vision and text token budgets fixed. To strictly meet the targets for different ratios, we pre-trained the model with text-only tokens for a specifically calculated number of tokens before introducing vision data. Surprisingly, we found that the vision ratio has minimal impact on final multimodal performance. In fact, early fusion with a lower vision ratio yields better results given a fixed total vision-text token budget. This motivates our native multimodal pre-training strategy: rather than aggressive vision-heavy training concentrated at the end, we adopt a moderate vision ratio integrated early in the training process, allowing the model to naturally develop balanced multimodal representations while benefiting from extended co-optimization of both modalities. # 2.2 Zero-Vision SFT Pretrained VLMs do not naturally perform vision-based tool-calling, which poses a cold-start problem for multimodal RL. Conventional approaches address this issue through manually annotated or prompt-engineered chain-of-thought (CoT) data [8], but such methods are limited in diversity, often restricting visual reasoning to simple diagrams and primitive tool manipulations (crop, rotate, flip). An observation is that high-quality text SFT data are relatively abundant and diverse. We propose a novel approach, zero-vision SFT, that uses only text SFT data to activate the visual, agentic capabilities during post-training. In this approach, all image manipulations are proxied through programmatic operations in IPython, effectively serving as a generalization of traditional vision tool-use. This "zero-vision" activation enables diverse reasoning behaviors, including pixel-level operations such as object size estimation via binarization and counting, and generalizes to visually grounded tasks such as object localization, counting, and OCR. Figure 2 illustrates the RL training curves, where the starting points are obtained from zero-vision SFT. The results show that zero-vision SFT is sufficient for activating vision capabilities while ensuring generalization across modalities. This phenomenon is likely due to the joint pretraining of text and vision data as described in Section 2.1. Compared to zero-vision SFT, our preliminary experiments show that text-vision SFT yields much worse performance on visual, agentic tasks, possibly because of the lack of high-quality vision data. # 2.3 Joint Multimodal Reinforcement Learning (RL) In this section, we describe the methodology implemented in K2.5 that enables effective multimodal RL, from outcome-based visual RL to emergent cross-modal transfer that enhances textual performance. Outcome-Based Visual RL Following the zero-vision SFT, the model requires further refinement to reliably incorporate visual inputs into reasoning. Text-initiated activation alone exhibits notable failure modes: visual inputs are sometimes ignored, and images may not be attended to when necessary. We employ outcome-based RL on tasks that explicitly require visual comprehension for correct solutions. We categorize these tasks into three domains: - Visual grounding and counting: Accurate localization and enumeration of objects within images; Chart and document understanding: Interpretation of structured visual information and text extraction; - Vision-critical STEM problems: Mathematical and scientific questions filtered to require visual inputs. Outcome-based RL on these tasks improves both basic visual capabilities and more complex agentic behaviors. Extracting these trajectories for rejection-sampling fine-tuning (RFT) enables a self-improving data pipeline, allowing subsequent joint RL stages to leverage richer multimodal reasoning traces. Figure 2: Vision RL training curves on vision benchmarks starting from minimal zero-vision SFT. By scaling vision RL FLOPs, the performance continues to improve, demonstrating that zero-vision activation paired with long-running RL is sufficient for acquiring robust visual capabilities. Table 2: Cross-Modal Transfer: Vision RL Improves Textual Knowledge <table><tr><td>Benchmark</td><td>Before Vision-RL</td><td>After Vision-RL</td><td>Improvement</td></tr><tr><td>MMLU-Pro</td><td>84.7</td><td>86.4</td><td>+1.7</td></tr><tr><td>GPQA-Diamond</td><td>84.3</td><td>86.4</td><td>+2.1</td></tr><tr><td>LongBench v2</td><td>56.7</td><td>58.9</td><td>+2.2</td></tr></table> Visual RL Improves Text Performance To investigate potential trade-offs between visual and textual performance, we evaluated text-only benchmarks before and after visual RL. Surprisingly, outcome-based visual RL produced measurable improvements in textual tasks, including MMLU-Pro $(84.7\% \rightarrow 86.4\%)$ , GPQA-Diamond $(84.3\% \rightarrow 86.4\%)$ , and LongBench v2 $(56.7\% \rightarrow 58.9\%)$ (Table 2). Analysis suggests that visual RL enhances calibration in areas requiring structured information extraction, reducing uncertainty on queries that resemble visually grounded reasoning (e.g., counting, OCR). These findings indicate that visual RL can contribute to cross-modal generalization, improving textual reasoning without observable degradation of language capabilities. Joint Multimodal RL Motivated by the finding that robust visual capabilities can emerge from zero-vision SFT paired with vision RL—which further enhances general text abilities—we adopt a joint multimodal RL paradigm during Kimi K2.5's post-training. Departing from conventional modality-specific expert divisions, we organize RL domains not by input modality but by abilities—knowledge, reasoning, coding, agentic, etc. These domain experts jointly learn from both pure-text and multimodal queries, while the Generative Reward Model (GRM) similarly optimizes across heterogeneous traces without modality barriers. This pardaigm ensures that capability improvements acquired through either textual or visual inputs inherently generalize to enhance related abilities across the alternate modality, thereby maximizing cross-modal capability transfer. # 3 Agent Swarm The primary challenge of existing agent-based systems lies in their reliance on sequential execution of reasoning and tool-calling steps. While this structure may be effective for simpler, short-horizon tasks, it becomes inadequate as the complexity of the task increases and the accumulated context grows. As tasks evolve to contain broad information gathering and intricate, multi-branch reasoning, sequential systems often encounter significant bottlenecks [5, 6, 7]. Figure 3: An agent swarm has a trainable orchestrator that dynamically creates specialized frozen subagents and decomposes complex tasks into parallelizable subtasks for efficient distributed execution. The limited capacity of a single agent working through each step one by one can lead to the exhaustion of practical reasoning depth and tool-call budgets, ultimately hindering the system's ability to handle more complex scenarios. To address this, we introduce Agent Swarm and Parallel Agent Reinforcement Learning (PARL). Instead of executing a task as a reasoning chain or relying on pre-specified parallelization heuristics, K2.5 initiates an Agent Swarm through dynamic task decomposition, subagent instantiation, and parallel subtask scheduling. Importantly, parallelism is not presumed to be inherently advantageous; decisions regarding whether, when, and how to parallelize are explicitly learned through environmental feedback and RL-driven exploration. As shown in Figure 4, the progression of performance demonstrates this adaptive capability, with the cumulative reward increasing smoothly as the orchestrator optimizes its parallelization strategy throughout training. Architecture and Learning Setup The PARL framework adopts a decoupled architecture comprising a trainable orchestrator and frozen subagents instantiated from fixed intermediate policy checkpoints. This design deliberately avoids end-to-end co-optimization to circumvent two fundamental challenges: credit assignment ambiguity and training instability. In this multi-agent setting, outcome-based rewards are inherently sparse and noisy; a correct final answer does not guarantee flawless subagent execution, just as a failure does not imply universal subagent error. By freezing the subagents and treating their outputs as environmental observations rather than differentiable decision points, we disentangle high-level coordination logic from low-level execution proficiency, leading to more robust convergence. To improve efficiency, we first train the orchestrator using small-size subagents before transitioning to larger models. Our RL framework also supports dynamically adjusting the inference instance ratios between subagents and the orchestrator, thereby maximizing the resource usage across the cluster. PARL Reward Training a reliable parallel orchestrator is challenging due to the delayed, sparse, and non-stationary feedback inherent in independent subagent execution. To address this, we define the PARL reward as: $$ r _ {\text {P A R L}} (x, y) = \lambda_ {1} \cdot \underbrace {r _ {\text {p a r a l l e l}}} _ {\text {i n s t a n t i a t i o n r e w a r d}} + \lambda_ {2} \cdot \underbrace {r _ {\text {f i n i s h}}} _ {\text {s u b - a g e n t f i n i s h r a t e}} + \underbrace {r _ {\text {p e r f}} (x , y)} _ {\text {t a k - l e v e l o u t c o m e}}. $$ The performance reward $r_{\mathrm{perf}}$ evaluates the overall success and quality of the solution $y$ for a given task $x$ . This is augmented by two auxiliary rewards, each addressing a distinct challenge in learning parallel orchestration. The reward $r_{\mathrm{parallel}}$ is introduced to mitigate serial collapse—a local optimum where the orchestrator defaults to single-agent execution. By incentivizing subagent instantiation, this term encourages the exploration of concurrent scheduling Figure 4: In our parallel-agent reinforcement learning environment, the training accuracy increases smoothly as training progresses. At the same time, the level of parallelism during training also gradually increases. spaces. The $r_{\mathrm{finish}}$ reward focuses on the successful completion of assigned subtasks. It is used to prevent spurious parallelism, a reward-hacking behavior in which the orchestrator increases parallel metrics dramatically by spawning many subagents without meaningful task decomposition. By rewarding completed subtasks, $r_{\mathrm{finish}}$ enforces feasibility and guides the policy toward valid and effective decompositions. To ensure the final policy optimizes for the primary objective, the hyperparameters $\lambda_{1}$ and $\lambda_{2}$ are annealed to zero over the course of training. Critical Steps as Resource Constraint To measure computational time cost in a parallel-agent setting, we define critical steps by analogy to the critical path in a computation graph. We model an episode as a sequence of execution stages indexed by $t = 1,\dots ,T$ . In each stage, the main agent executes an action, which corresponds to either direct tool invocation or the instantiation of a group of subagents running in parallel. Let $S_{\mathrm{main}}^{(t)}$ denote the number of steps taken by the main agent in stage $t$ (typically $S_{\mathrm{main}}^{(t)} = 1$ ), and $S_{\mathrm{sub},i}^{(t)}$ denote the number of steps taken by the $i$ -th subagent in that parallel group. The duration of stage $t$ is governed by the longest-running subagent within that cohort. Consequently, the total critical steps for an episode are defined as $$ \text {C r i t i c a l S t e p s} = \sum_ {t = 1} ^ {T} \left(S _ {\text {m a i n}} ^ {(t)} + \max _ {i} S _ {\text {s u b}, i} ^ {(t)}\right). $$ By constraining training and evaluation using critical steps rather than total steps, the framework explicitly incentivizes effective parallelization. Excessive subtask creation that does not reduce the maximum execution time of parallel groups yields little benefit under this metric, while well-balanced task decomposition that shortens the longest parallel branch directly reduces critical steps. As a result, the orchestrator is encouraged to allocate work across subagents in a way that minimizes end-to-end latency, rather than merely maximizing concurrency or total work performed. Prompt Construction for Parallel-agent Capability Induction To incentivize the orchestrator to leverage the advantages of parallelization, we construct a suite of synthetic prompts designed to stress the limits of sequential agentic execution. These prompts emphasize either wide search, requiring simultaneous exploration of many independent information sources, or deep search, requiring multiple reasoning branches with delayed aggregation. We additionally include tasks inspired by real-world workloads, such as long-context document analysis and large-scale file downloading. When executed sequentially, these tasks are difficult to complete within fixed reasoning-step and tool-call budgets. By construction, they encourage the orchestrator to allocate subtasks in parallel, enabling completion within fewer critical steps than would be feasible for a single sequential agent. Importantly, the prompts do not explicitly instruct the model to parallelize. Instead, they shape the task distribution such that parallel decomposition and scheduling strategies are naturally favored. # 4 Method Overview # 4.1 Foundation: Kimi K2 Base Model The foundation of Kimi K2.5 is Kimi K2 [53], a trillion-parameter mixture-of-experts (MoE) transformer [59] model pre-trained on 15 trillion high-quality text tokens. Kimi K2 employs the token-efficient MuonClip optimizer [30, Table 3: Overview of training stages: data composition, token volumes, sequence lengths, and trainable components. <table><tr><td>Stages</td><td>ViT Training</td><td>Joint Pre-training</td><td>Joint Long-context Mid-training</td></tr><tr><td>Data</td><td>Alt text Synthesis Caption Grounding, OCR, Video</td><td>+ Text, Knowledge Interleaving Video, OS Screenshot</td><td>+ High-quality Text & Multimodal Long Text, Long Video Reasoning, Long-CoT</td></tr><tr><td>Sequence length</td><td>4096</td><td>4096</td><td>32768→262144</td></tr><tr><td>Tokens</td><td>1T</td><td>15T</td><td>500B→200B</td></tr><tr><td>Training</td><td>ViT</td><td>ViT & LLM</td><td>ViT & LLM</td></tr></table> 34] with QK-Clip for training stability. The model comprises 1.04 trillion total parameters with 32 billion activated parameters, utilizing 384 experts with 8 activated per token (sparsity of 48). For detailed descriptions of MuonClip, architecture design, and training infrastructure, we refer to the Kimi K2 technical report [53]. # 4.2 Model Architecture The multimodal architecture of Kimi K2.5 consists of three components: a three-dimensional native-resolution vision encoder (MoonViT-3D), an MLP projector, and the Kimi K2 MoE language model, following the design principles established in Kimi-VL [54]. MoonViT-3D: Shared Embedding Space for Images and Videos In Kimi-VL, we employ MoonViT to natively process images at their original resolutions, eliminating the need for complex sub-image splitting and splicing operations. Initialized from SigLIP-SO-400M [78], MoonViT incorporates the patch packing strategy from NaViT [15], where single images are divided into patches, flattened, and sequentially concatenated into 1D sequences, thereby enabling efficient simultaneous training on images at varying resolutions. To maximize the transfer of image understanding capabilities to video, we introduce MoonViT-3D with a unified architecture, fully shared parameters, and a consistent embedding space. By generalizing the "patch n' pack" philosophy to the temporal dimension, up to four consecutive frames are treated as a spatiotemporal volume: 2D patches from these frames are jointly flattened and packed into a single 1D sequence, allowing the identical attention mechanism to operate seamlessly across both space and time. While the extra temporal attention improves understanding on high-speed motions and visual effects, the sharing maximizes knowledge generalization from static images to dynamic videos, achieving strong video understanding performance (see in Tab. 4) without requiring specialized video modules or architectural bifurcation. Prior to the MLP projector, lightweight temporal pooling aggregates patches within each temporal chunk, yielding $4 \times$ temporal compression to significantly extend feasible video length. The result is a unified pipeline where knowledge and ability obtained from image pretraining transfers holistically to videos through one shared parameter space and feature representation. # 4.3 Pre-training Pipeline As illustrated in Table 3, Kimi K2.5's pre-training builds upon the Kimi K2 language model checkpoint and processes approximately 15T tokens across three stages: first, standalone ViT training to establish a robust native-resolution visual encoder; second, joint pre-training to simultaneously enhance language and multimodal capabilities; and third, mid-training on high-quality data and long-context activation to refine capabilities and extend context windows. ViT Training Stage The MoonViT-3D is trained on image-text and video-text pairs, where the text components consist of a variety of targets: image alt texts, synthetic captions of images and videos, grounding bboxes, and OCR texts. The training incorporates two objectives following CoCa [75]: a SigLIP [78] loss $L_{siglip}$ (a variant of contrastive loss) and a cross-entropy loss $L_{caption}$ for caption generation conditioned on input images. We adopt a two-stage alignment strategy. In the first stage, we optimize solely the captioning loss $L_{caption}$ to align MoonViT-3D with Moonlight-16B-A3B [34], consuming 1T tokens, in which stage the ViT weights will be updated. A very short second stage follows, updating only the MLP projector to bridge the ViT with the 1T LLM for smoother joint pre-training. Joint Training Stages The joint pre-training stage continues from a near-end Kimi K2 checkpoint over additional 15T vision-text tokens at 4K sequence length. The data recipe extends Kimi K2's pre-training distribution by introducing unique tokens, adjusting data proportions with increased weight on coding-related content, and controlling maximum epochs per data source. The third stage performs long-context activation with integrated higher-quality mid-training data, sequentially extending context length via YaRN [44] interpolation. This yields significant generalization improvements in long-context text understanding and long video comprehension. # 4.4 Post-Training # 4.4.1 Supervised Fine-Tuning Following the SFT pipeline established by Kimi K2 [53], we developed K2.5 by synthesizing high-quality candidate responses from K2, K2 Thinking and a suite of proprietary in-house expert models. Our data generation strategy employs specialized pipelines tailored to specific domains, integrating human annotation with advanced prompt engineering and multi-stage verification. This methodology produced a large-scale instruction-tuning dataset featuring diverse prompts and intricate reasoning trajectories, ultimately training the model to prioritize interactive reasoning and precise tool-calling for complex, real-world applications. # 4.4.2 Reinforcement Learning Reinforcement learning constitutes a crucial phase of our post-training. To facilitate joint optimization across text and vision modalities, as well as to enable PARL for agent swarm, we develop a Unified Agentic Reinforcement Learning Environment (Appendix D) and optimize the RL algorithms. Both text-vision joint RL and PARL are built upon the algorithms described in this section. Policy Optimization For each problem $x$ sampled from a dataset $\mathcal{D}$ , $K$ responses $\{y_1, \ldots, y_K\}$ are generated using the previous policy $\pi_{\mathrm{old}}$ . We optimize the model $\pi_{\theta}$ with respect to the following objective: $$ L _ {\mathrm {R L}} (\theta) = \mathbb {E} _ {x \sim \mathcal {D}} \left[ \frac {1}{N} \sum_ {j = 1} ^ {K} \sum_ {i = 1} ^ {| y _ {j} |} \operatorname {C l i p} \left(\frac {\pi_ {\theta} \left(y _ {j} ^ {i} \mid x , y _ {j} ^ {0 : i}\right)}{\pi_ {\mathrm {o l d}} \left(y _ {j} ^ {i} \mid x , y _ {j} ^ {0 : i}\right)}, \alpha , \beta\right) (r (x, y _ {j}) - \bar {r} (x)) - \tau \left(\log \frac {\pi_ {\theta} \left(y _ {j} ^ {i} \mid x , y _ {j} ^ {0 : i}\right)}{\pi_ {\mathrm {o l d}} \left(y _ {j} ^ {i} \mid x , y _ {j} ^ {0 : i}\right)}\right) ^ {2} \right]. \tag {1} $$ Here $\alpha, \beta, \tau > 0$ are hyperparameters, $y_{0:i}^{j}$ is the prefix up to the $i$ -th token of the $j$ -th response, $N = \sum_{i=1}^{K} |y_i|$ is the total number of generated tokens in a batch, $\bar{r}(x) = \frac{1}{K} \sum_{j=1}^{K} r(x,y_j)$ is the mean reward of all generated responses. This loss function departs from the policy optimization algorithm used in K1.5 [31] by introducing a token-level clipping mechanism designed to mitigate the off-policy divergence amplified by discrepancies between training and inference frameworks. The mechanism functions as a simple gradient masking scheme: policy gradients are computed normally for tokens with log-ratios within the interval $[\alpha, \beta]$ , while gradients for tokens falling outside this range are zeroed out. Notably, a key distinction from standard PPO clipping [50] is that our method relies strictly on the log-ratio to explicitly bound off-policy drift, regardless of the sign of the advantages. This approach aligns with recent strategies proposed to stabilize large-scale RL training [74, 79]. Empirically, we find this mechanism essential for maintaining training stability in complex domains requiring long-horizon, multi-step tool-use reasoning. We employ the MuonClip optimizer [30, 34] to minimize this objective. Reward Function We apply a rule-based outcome reward for tasks with verifiable solutions, such as reasoning and agentic tasks. To optimize resource consumption, we also incorporate a budget-control reward aimed at enhancing token efficiency. For general-purpose tasks, we employ Generative Reward Models (GRMs) that provide granular evaluations aligned with Kimi's internal value criteria. In addition, for visual tasks, we design task-specific reward functions to provide fine-grained supervision. For visual grounding and point localization tasks, we employ an F1-based reward with soft matching: grounding tasks derive soft matches from Intersection over Union (IoU) and point tasks derive soft matches from Gaussian-weighted distances under optimal matching. For polygon segmentation tasks, we rasterize the predicted polygon into a binary mask and compute the segmentation IoU against the ground-truth mask to assign the reward. For OCR tasks, we adopt normalized edit distance to quantify character-level alignment between predictions and ground-truth. For counting tasks, rewards are assigned based on the absolute difference between predictions and ground-truth. Furthermore, we synthesize complex visual puzzle problems and utilize an LLM verifier (Kimi K2) to provide feedback. Generative Reward Models Kimi K2 leverages a self-critique rubric reward for open-ended generation [53], and K2.5 extends this line of work by systematically deploying Generative Reward Models (GRMs) across a broad range of agentic behaviors and multimodal trajectories. Rather than limiting reward modeling to conversational outputs, we apply GRMs on top of verified reward signals in diverse environments, including chat assistants, coding agents, search agents, and artifact-generating agents. Notably, GRMs function not as binary adjudicators, but as fine-grained evaluators aligned with Kimi's values that are critical to user experiences, such as helpfulness, response readiness, contextual relevance, appropriate level of detail, aesthetic quality of generated artifacts, and strict instruction following. This design allows the reward signal to capture nuanced preference gradients that are difficult to encode with purely rule-based or task-specific verifiers. To mitigate reward hacking and overfitting to a single preference signal, we employ multiple alternative GRM rubrics tailored to different task contexts. Token Efficient Reinforcement Learning Token efficiency is central to LLMs with test-time scaling. While test-time scaling inherently trades computation for reasoning quality, practical gains require algorithmic innovations that actively navigate this trade-off. Our previous findings indicate that imposing a problem-dependent budget effectively constrains inference-time compute, incentivizing the model to generate more concise chain of thought reasoning patterns without unnecessary token expansion [31, 53]. However, we also observe a length-overfitting phenomenon: models trained under rigid budget constraints often fail to generalize to higher compute scales. Consequently, they cannot effectively leverage additional inference-time tokens to solve complex problems, instead defaulting to truncated reasoning patterns. To this end, we propose Toggle, a training heuristic that alternates between inference-time scaling and budget-constrained optimization: for learning iteration $t$ , the reward function is defined by $$ \tilde {r} (x, y) = \left\{ \begin{array}{l l} r (x, y) \cdot \mathbb {I} \left\{\frac {1}{K} \sum_ {i = 1} ^ {K} r (x, y _ {i}) < \lambda \text {o r} | y _ {i} | \leq \operatorname {b u d g e t} (x) \right\} & \text {i f} \lfloor t / m \rfloor \pmod {2} = 0 (\text {P h a s e 0}) \\ r (x, y) & \text {i f} \lfloor t / m \rfloor \pmod {2} = 1 (\text {P h a s e 1}) \end{array} \right.. $$ where $\lambda$ and $m$ are hyper-parameters of the algorithm and $K$ is the number of rollouts per problem. Specifically, the algorithm alternates between two optimization phases every $m$ iterations: - Phase0 (budget limited phase): The model is trained to solve the problem within a task-dependent token budget. To prevent a premature sacrifice of quality for efficiency, this constraint is conditionally applied: it is only enforced when the model's mean accuracy for a given problem exceeds the threshold $\lambda$ . - Phase1 (standard scaling phase): The model generates responses up to the maximum token limit, encouraging the model to leverage computation for better inference-time scaling. The problem-dependent budget is estimated from the $\rho$ -th percentile of token lengths among the subset of correct responses: $$ \operatorname {b u d g e t} (x) = \text {P e r c e n t i l e} \left(\left\{\left| y _ {j} \right| \mid r (x, y _ {i}) = 1, i = 1, \dots , K \right\}, \rho\right). \tag {2} $$ This budget is estimated once at the beginning of training and remains fixed thereafter. Notably, Toggle functions as a stochastic alternating optimization for a bi-objective problem. It is specifically designed to reconcile reasoning capabilities with computational efficiency. We evaluate the effectiveness of Toggle on K2 Thinking [1]. As shown in Figure 5, we observe a consistent reduction in output length across nearly all benchmarks. On average, Toggle decreases output tokens by $25\sim 30\%$ with a negligible impact on performance. We also observe that redundant patterns in the chain-of-thought, such as repeated verifications and mechanical calculations, decrease substantially. Furthermore, Toggle shows strong domain generalization. For example, when trained exclusively on mathematics and programming tasks, the model still achieves consistent token reductions on GPQA and MMLU-Pro with only marginal degradation in performance (Figure 5). # 4.5 Training Infrastructure Kimi K2.5 inherits the training infrastructure from Kimi K2 [53] with minimal modifications. For multimodal training, we propose Decoupled Encoder Process, where the vision encoder is incorporated into the existing pipeline with negligible additional overhead. # 4.5.1 Decoupled Encoder Process (DEP) In a typical multimodal training paradigm utilizing Pipeline Parallelism (PP), the vision encoder and text embedding are co-located in the first stage of the pipeline (Stage-0). However, due to the inherent variations of multimodal input size (e.g., image counts and resolutions), Stage-0 suffers from drastic fluctuations in both computational load and memory usage. This forces existing solutions to adopt custom PP configurations for vision-language models — for instance, [54] manually adjusts the number of text decoder layers in Stage-0 to reserve memory. While this Token Efficiency before and after Toggle across Benchmarks Figure 5: Comparison of model performance and token usage for Kimi K2 Thinking following token-efficient RL. compromise alleviates memory pressure, it does not fundamentally resolve the load imbalance caused by multimodal input sizes. More critically, it precludes the direct reuse of parallel strategies that have been highly optimized for text-only training. Leveraging the unique topological position of the visual encoder within the computation graph — specifically, its role as the start of the forward pass and the end of the backward pass — our training uses Decoupled Encoder Process (DEP), which is composed of three stages in each training step: - Balanced Vision Forward: We first execute the forward pass for all visual data in the global batch. Because the vision encoder is small, we replicate it on all GPUs regardless of other parallelism strategies. During this phase, the forward computational workload is evenly distributed across all GPUs based on load metrics (e.g., image or patch counts). This eliminates load-imbalance caused by PP and visual token counts. To minimize peak memory usage, we discard all intermediate activations, retaining only the final output activations. The results are gathered back to PP Stage-0; - Backbone Training: This phase performs the forward and backward passes for the main transformer backbone. By discarding intermediate activations in the preceding phase, we can now fully leverage any efficient parallel strategies validated in pure text training. After this phase, gradients are accumulated at the visual encoder output; - Vision Recomputation & Backward: We re-compute the vision encoder forward pass, followed by a backward pass to compute gradients for parameters in the vision encoder; DEP not only achieves load-balance, but also decouples the optimization strategy of the vision encoder and the main backbone. K2.5 seamlessly inherits the parallel strategy of K2, achieving a multimodal training efficiency of $90\%$ relative to text-only training. We note a concurrent work, LongCat-Flash-Omni [55], shares a similar design philosophy. # 5 Evaluations # 5.1 Main Results # 5.1.1 Evaluation Settings Benchmarks We evaluate Kimi K2.5 on a comprehensive benchmark suite spanning text-based reasoning, competitive and agentic coding, multimodal understanding (image and video), autonomous agentic execution, and computer use. Our benchmark taxonomy is organized along the following capability axes: - Reasoning & General: Humanity's Last Exam (HLE) [46], AIME 2025 [4], HMMT 2025 (Feb) [58], IMO-AnswerBench [37], GPQA-Diamond [47], MMLU-Pro [64], SimpleQA Verified [22], AdvancedIF [23], and LongBench v2 [9]. Coding: SWE-Bench Verified [29], SWE-Bench Pro (public) [16], SWE-Bench Multilingual [29], Terminal Bench 2.0 [39], PaperBench (CodeDev) [52], CyberGym [66], SciCode [56], OJBench (cpp) [65], and LiveCodeBench (v6) [28]. - Agentic Capabilities: BrowseComp [68], WideSearch [69], DeepSearchQA [60], FinSearchComp (T2&T3) [26], Seal-0 [45], GDPVal [43]. - Image Understanding: (math & reasoning) MMMU-Pro [76], MMMU (val) [77], CharXiv (RQ) [67], MathVision [61] and MathVista (mini) [36]; (vision knowledge) SimpleVQA [13] and WorldVQA ${}^{2}$ ; (perception) ZeroBench (w/ and w/o tools) [48], BabyVision [12], BLINK [18] and MMVP [57]; (OCR & document) OCR-Bench [35], OmniDocBench 1.5 [42] and InfoVQA [38]. - Video Understanding: VideoMMU [25], MMVU [80], MotionBench [24], Video-MME [17] (with subtitles), LongVideoBench [70], and LVBench [62]. - Computer Use: OSWorld-Verified [72, 73], and WebArena [81]. Baselines We benchmark against state-of-the-art proprietary and open-source models. For proprietary models, we compare against Claude Opus 4.5 (with extended thinking) [6], GPT-5.2 (with xhigh reasoning effort) [41], and Gemini 3 Pro (with high reasoning-level) [20]. For open-source models, we include DeepSeek-V3.2 (with thinking mode enabled) [14] for text benchmarks, while vision benchmarks report Qwen3-VL-235B-A22B-Thinking [8] instead. Evaluation Configurations Unless otherwise specified, all Kimi K2.5 evaluations use temperature $= 1.0$ , top-p $= 0.95$ , and a context length of 256k tokens. Benchmarks without publicly available scores were re-evaluated under identical conditions and marked with an asterisk $(^{*})$ . The full evaluation settings can be found in appendix E. # 5.1.2 Evaluation Results Comprehensive results comparing Kimi K2.5 against proprietary and open-source baselines are presented in Table 4. We highlight key observations across core capability domains: Reasoning and General Kimi K2.5 achieves competitive performance with top-tier proprietary models on rigorous STEM benchmarks. On Math tasks, AIME 2025, K2.5 scores $96.1\%$ , approaching GPT-5.2's perfect score while outperforming Claude Opus 4.5 $(92.8\%)$ and Gemini 3 Pro $(95.0\%)$ . This high-level performance extends to the HMMT 2025 $(95.4\%)$ and IMO-AnswerBench $(81.8\%)$ , demonstrating K2.5's superior reasoning depth. Kimi K2.5 also exhibits remarkable knowledge and scientific reasoning capabilities, scoring $36.9\%$ on SimpleQA Verified, $87.1\%$ on MMLU-Pro and $87.6\%$ on GPQA. Notably, on HLE without the use of tools, K2.5 achieves an HLE-Full score of $30.1\%$ , with component-wise scores of $31.5\%$ on text subset and $21.3\%$ on image subset. When tool-use is enabled, K2.5's HLE-Full score rises to $50.2\%$ , with $51.8\%$ (text) and $39.8\%$ (image), significantly outperforming Gemini 3 Pro $(45.8\%)$ and GPT-5.2 $(45.5\%)$ . In addition to reasoning and knowledge, K2.5 shows strong instruction-following performance $(75.6\%$ on AdvancedIF) and competitive long-context abilities, achieving $61.0\%$ on LongBench v2 compared to both proprietary and open-source models. Complex Coding and Software Engineering Kimi K2.5 exhibits strong software engineering capabilities, especially on realistic coding and maintenance tasks. It achieves $76.8\%$ on SWE-Bench Verified and $73.0\%$ on SWE-Bench Multilingual, outperforming Gemini 3 Pro while remaining competitive with Claude Opus 4.5 and GPT5.2. On LiveCodeBench v6, Kimi K2.5 reaches $85.0\%$ , surpassing DeepSeekV3.2 $(83.3\%)$ and Claude Opus 4.5 $(82.2\%)$ , highlighting its robustness on live, continuously updated coding challenges. On TerminalBench 2.0, PaperBench, and SciCode, it scores $50.8\%$ , $63.5\%$ , and $48.7\%$ respectively, demonstrating stable competition level performance in automated software engineering and problem solving across diverse domains. In addition, K2.5 attains a score of 41.3 on CyberGym, on the task of finding previously discovered vulnerabilities in real opensource software projects given only a highlevel description of the weakness, further underscoring its effectiveness in security-oriented software analysis. Agentic Capabilities Kimi K2.5 establishes new state-of-the-art performance on complex agentic search and browsing tasks. On BrowseComp, K2.5 achieves $60.6\%$ without context management techniques, $74.9\%$ with Discard-all context management [14] —substantially outperforming GPT-5.2's reported $65.8\%$ , Claude Opus 4.5 $(37.0\%)$ and Gemini 3 Pro $(37.8\%)$ . Similarly, WideSearch reaches $72.7\%$ on item-f1. On DeepSearchQA $(77.1\%)$ , FinSearch-CompT2&T3 $(67.8\%)$ and Seal-0 $(57.4\%)$ , K2.5 leads all evaluated models, demonstrating superior capacity for agentic deep research, information synthesis, and multi-step tool orchestration. Table 4: Performance comparison of Kimi K2.5 against open-source and proprietary models. Bold denotes the global SOTA; Data points marked with * are taken from our internal evaluations. † refers to their scores of text-only subset. <table><tr><td rowspan="2">Benchmark</td><td rowspan="2">Kimi K2.5</td><td colspan="3">Proprietary</td><td colspan="2">Open Source</td></tr><tr><td>Claude Opus 4.5</td><td>GPT-5.2 (xhigh)</td><td>Gemini 3 Pro</td><td>DeepSeek-V3.2</td><td>Qwen3-VL-235B-A22B</td></tr><tr><td colspan="7">Reasoning & General</td></tr><tr><td>HLE-Full</td><td>30.1</td><td>30.8</td><td>34.5</td><td>37.5</td><td>25.1†</td><td>-</td></tr><tr><td>HLE-Full w/ tools</td><td>50.2</td><td>43.2</td><td>45.5</td><td>45.8</td><td>40.8†</td><td>-</td></tr><tr><td>AIME 2025</td><td>96.1</td><td>92.8</td><td>100</td><td>95.0</td><td>93.1</td><td>-</td></tr><tr><td>HMMT 2025 (Feb)</td><td>95.4</td><td>92.9*</td><td>99.4</td><td>97.3*</td><td>92.5</td><td>-</td></tr><tr><td>IMO-AnswerBench</td><td>81.8</td><td>78.5*</td><td>86.3</td><td>83.1*</td><td>78.3</td><td>-</td></tr><tr><td>GPQA-Diamond</td><td>87.6</td><td>87.0</td><td>92.4</td><td>91.9</td><td>82.4</td><td>-</td></tr><tr><td>MMLU-Pro</td><td>87.1</td><td>89.3*</td><td>86.7*</td><td>90.1</td><td>85.0</td><td>-</td></tr><tr><td>SimpleQA Verified</td><td>36.9</td><td>44.1</td><td>38.9</td><td>72.1</td><td>27.5</td><td>-</td></tr><tr><td>AdvancedIF</td><td>75.6</td><td>63.1</td><td>81.1</td><td>74.7</td><td>58.8</td><td>-</td></tr><tr><td>LongBench v2</td><td>61.0</td><td>64.4*</td><td>54.5*</td><td>68.2*</td><td>59.8*</td><td>-</td></tr><tr><td colspan="7">Coding</td></tr><tr><td>SWE-Bench Verified</td><td>76.8</td><td>80.9</td><td>80.0</td><td>76.2</td><td>73.1</td><td>-</td></tr><tr><td>SWE-Bench Pro (public)</td><td>50.7</td><td>55.4*</td><td>55.6</td><td>-</td><td>-</td><td>-</td></tr><tr><td>SWE-Bench Multilingual</td><td>73.0</td><td>77.5</td><td>72.0</td><td>65.0</td><td>70.2</td><td>-</td></tr><tr><td>Terminal Bench 2.0</td><td>50.8</td><td>59.3</td><td>54.0</td><td>54.2</td><td>46.4</td><td>-</td></tr><tr><td>PaperBench (CodeDev)</td><td>63.5</td><td>72.9*</td><td>63.7*</td><td>-</td><td>47.1</td><td>-</td></tr><tr><td>CyberGym</td><td>41.3</td><td>50.6</td><td>-</td><td>39.9*</td><td>17.3*</td><td>-</td></tr><tr><td>SciCode</td><td>48.7</td><td>49.5</td><td>52.1</td><td>56.1</td><td>38.9</td><td>-</td></tr><tr><td>OJBench (cpp)</td><td>57.4</td><td>54.6*</td><td>-</td><td>68.5*</td><td>54.7*</td><td>-</td></tr><tr><td>LiveCodeBench (v6)</td><td>85.0</td><td>82.2*</td><td>-</td><td>87.4*</td><td>83.3</td><td>-</td></tr><tr><td colspan="7">Agentic</td></tr><tr><td>BrowseComp</td><td>60.6</td><td>37.0</td><td>65.8</td><td>37.8</td><td>51.4</td><td>-</td></tr><tr><td>BrowseComp (w/ ctx manage)</td><td>74.9</td><td>57.8</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>BrowseComp (Agent Swarm)</td><td>78.4</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>WideSearch</td><td>72.7</td><td>76.2*</td><td>-</td><td>57.0</td><td>32.5*</td><td>-</td></tr><tr><td>WideSearch (Agent Swarm)</td><td>79.0</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>DeepSearchQA</td><td>77.1</td><td>76.1*</td><td>71.3*</td><td>63.2*</td><td>60.9*</td><td>-</td></tr><tr><td>FinSearchCompT2&T3</td><td>67.8</td><td>66.2*</td><td>-</td><td>49.9</td><td>59.1*</td><td>-</td></tr><tr><td>Seal-0</td><td>57.4</td><td>47.7*</td><td>45.0</td><td>45.5*</td><td>49.5*</td><td>-</td></tr><tr><td>GDPVal-AA</td><td>41.0</td><td>45.0</td><td>48.0</td><td>35.0</td><td>34.0</td><td>-</td></tr><tr><td colspan="7">Image</td></tr><tr><td>MMMU-Pro</td><td>78.5</td><td>74.0</td><td>79.5*</td><td>81.0</td><td>-</td><td>69.3</td></tr><tr><td>MMMU (val)</td><td>84.3</td><td>80.7</td><td>86.7*</td><td>87.5*</td><td>-</td><td>80.6</td></tr><tr><td>CharXiv (RQ)</td><td>77.5</td><td>67.2*</td><td>82.1</td><td>81.4</td><td>-</td><td>66.1</td></tr><tr><td>MathVision</td><td>84.2</td><td>77.1*</td><td>83.0</td><td>86.1*</td><td>-</td><td>74.6</td></tr><tr><td>MathVista (mini)</td><td>90.1</td><td>80.2*</td><td>82.8*</td><td>89.8*</td><td>-</td><td>85.8</td></tr><tr><td>SimpleVQA</td><td>71.2</td><td>69.7*</td><td>55.8*</td><td>69.7*</td><td>-</td><td>56.8*</td></tr><tr><td>WorldVQA</td><td>46.3</td><td>36.8</td><td>28.0</td><td>47.4</td><td>-</td><td>23.5</td></tr><tr><td>ZeroBench</td><td>9</td><td>3*</td><td>9*</td><td>8*</td><td>-</td><td>4*</td></tr><tr><td>ZeroBench w/ tools</td><td>11</td><td>9*</td><td>7*</td><td>12*</td><td>-</td><td>3*</td></tr><tr><td>BabyVision</td><td>36.5</td><td>14.2</td><td>34.4</td><td>49.7</td><td>-</td><td>22.2</td></tr><tr><td>BLINK</td><td>78.9</td><td>68.8*</td><td>-</td><td>78.7*</td><td>-</td><td>68.9</td></tr><tr><td>MMVP</td><td>87.0</td><td>80.0*</td><td>83.0*</td><td>90.0*</td><td>-</td><td>84.3</td></tr><tr><td>OmniDocBench 1.5</td><td>88.8</td><td>87.7*</td><td>85.7</td><td>88.5</td><td>-</td><td>82.0*</td></tr><tr><td>OCRBench</td><td>92.3</td><td>86.5*</td><td>80.7*</td><td>90.3*</td><td>-</td><td>87.5</td></tr><tr><td>InfoVQA (test)</td><td>92.6</td><td>76.9*</td><td>84*</td><td>57.2*</td><td>-</td><td>89.5</td></tr><tr><td colspan="7">Video</td></tr><tr><td>VideoMMU</td><td>86.6</td><td>84.4*</td><td>85.9</td><td>87.6</td><td>-</td><td>80.0</td></tr><t