Model Level Chain-of-Thought Support
While it’s typical to implement CoT at the prompt level, this approach has two main drawbacks:- Performance: Additional instructions and tokens are needed to guide the CoT process, introducing overhead in terms of both cost and latency.
- Reliability: Ensuring the model follows the correct format is challenging, especially for function calling, which involves a mix of JSON (function calls) and free text (thinking). This complexity makes streaming extremely difficult. There are tricks to mitigate this, such as adding an additional “explanation” parameter to the function definition, but this has limitations. When the explanation is generated, the model has already decided to trigger functions and which exact function(s) to trigger, so the improvement in accuracy is limited.
How to Use
CoT mode can toggled request time with an additional parameter:include_thinking
. See the code example below: