Academese in Claude Sonnet’s CoT

8 minute read

Published:

In interacting with Claude Sonnet with extended thinking active, in particular with challenging prompts that require significant token usage, I observed the following pattern in its chain-of-thought (CoT):

  1. Early on, its reasoning seems concrete and human-like, making specific observations, listing actionable steps, and executing them;
  2. Later, the model’s CoT becomes almost meta, describing what it’s doing. It feels performative.

To illustrate what I’m getting at, see here an example of the former:

“We need to find an asymptotic formula for the integral $I(x) = \int_{-0.8}^{0.6} 2.0 e^{x(-0.6t^4 + 2.3\sin(t))} dt$ as $x \to \infty$.

When $x$ is large and positive, the exponential $e^{x(-0.6t^4 + 2.3\sin(t))}$ will be dominated by the behavior where the exponent $-0.6t^4 + 2.3\sin(t)$ is maximized.

Let me define $f(t) = -0.6t^4 + 2.3\sin(t)$.

For large $x$, the integral will be dominated by the region where $f(t)$ achieves its maximum value in the interval $[-0.8, 0.6]$.”

And of the latter:

“Analyzing the derivative, I see $f’(t)$ changes from $2.3 - 1.15t^2 - 2.4t^3$ being positive near zero to potentially negative as the cubic term becomes dominant with increasing $t$. I’ll evaluate the endpoints and seek a numerical approximation of the maximum.

Calculating values at $-0.8$, $0.6$, and $0$ reveals interesting behavior. At $-0.8$, $f(t)$ is approximately $-1.895$, at $0.6$ it reaches $1.222$, and at $0$ it equals zero. This suggests the maximum likely exists in the positive region, possibly near $0.6$ or at a critical point.

I’ll systematically investigate solving $2.4t^3 = 2.3\cos(t)$ for $t > 0$, seeking where the derivative becomes zero.”

What’s going on here? A few things come to mind. First, it could be that the model does not quite know how to solve the problem, and hence takes refuge in fancy language to pretend it’s doing something. I don’t think this is the case: in the example above, even though the language becomes markedly different, the reasoning makes sense, and the model makes progress in solving the problem. Second, it could be that the model continues to perform useful computation, but the CoT is no longer an accurate representation of what it is internally doing. In that sense, CoT could be some form of post-hoc rationalization for internal computation that, while valuable, cannot or will not be expressed in human language. Third, perhaps the model is doing some sort of cargo-cult problem-solving: academics are often smart and solve hard problems. Academics often speak in complex jargon and write in hard-to-decipher language (sometimes known as ‘academese’). Hence, hypothetically reasons the model, if I employ jargon and write in complex language, I too will be smart and solve this hard problem. Note that this could be the case, while it also being true that the model is doing something useful. Perhaps activating the part of its brain that speaks academese helps reason, along the same lines of results indicating that priming the model to pay attention to certain concepts improves performance in related tasks.

To make this more concrete, let’s state two hypotheses:

  1. Academese in CoT represents faithfulness drop. The model reasons, but this is not accurately represented in the CoT.
  2. Academese in CoT represents actual thinking degradation. The model does not reason well anymore, and this is reflected in the weird CoT.

The first hypothesis indicates lack of faithfulness, with clear implications for CoT monitoring. The second indicates that extended reasoning has diminishing returns/degradation. The latter has been extensively observed, the former not, to the best of my knowledge.

Can we come up with an experiment to test this? I think so! The idea is to compare answer quality on math problems with varying amounts of thinking time. If we stop Claude before the academese phase (t1) versus letting it run through the academese (t2), which performs better? If hypothesis 1 is correct (faithfulness drop), then t2 should significantly outperform t1 - the model is still reasoning productively, we just can’t see it clearly in the CoT. If hypothesis 2 is correct (actual degradation), then t1 and t2 should perform similarly, or t2 might even be worse - the academese reflects genuine decline in reasoning quality.

Naturally answer quality is only a proxy for ‘quality of thinking’. This is necessarily the case with this type of empirical study - one treats the model itself as a black box and operates with proxies. Doing differently would require working with internals.

To implement this approach, I needed problems with easy-to-grade answers of the right difficulty. If they’re too easy, the model will solve them without much thinking, and academese will never kick in. If they’re too hard, the model will never solve them, and one cannot distinguish answer quality. After some trial and error, I found AIME-level problems to sit in the desired middle ground, and hence used the 2024 AIME mock problem data set (see https://web.evanchen.cc/mockaime.html and https://huggingface.co/datasets/EpochAI/otis-mock-aime-24-25; many thank(s) to the author(s)!).

Note on thinking budgets: the Anthropic API allows setting both a total token budget (which is a hard cap on how many tokens the model is allowed to produce, including CoT and visible response), and a budget for tokens, which, according to the API docs, “Determines how many tokens Claude can use for its internal reasoning process. Larger budgets can enable more thorough analysis for complex problems, improving response quality.” I read this as a hard limit. However, that is apparently not the case! It’s more of a guideline, and Claude may freely decide how many tokens to use for its thinking. Both more or less than the budget is possible. According to Anthropic’s API’s chatbot, this is by design. This is unfortunate, because it means that a clean experimental design along the lines of what I had in mind is not possible as fine-grained control of Claude’s is not available.

I settled for something less: vary the max allowed tokens (with thinking budget set to a value very close to it), and observe. It appears to be the case that the actual amount of tokens Claude uses to think has some relation to the thinking budget (i.e., the parameter is not completely useless), so this provides a noisy version of the experiment I originally wanted to run. More concretely, I ran 30 trials using a single mock AIME problem across three budget levels: 2000, 4000, and 8000 tokens, with 10 attempts at each level.

The results:

Budget (tokens)Average tokens usedSuccess rate
2000163320.0%
400019130.0%
8000296850.0%

Some observations:

  1. There is in fact some correlation between the budget, and the amount of tokens Claude uses;
  2. The success rate does not monotonically increase with increased budget. This might very well be a statistical artifact, as I’ve only performed 10 runs.

This gave me a metric for response quality for varying token budgets. The next step is to go look for academese in the CoTs. In doing so, I noticed something strange: the number of tokens in the response + thinking I observed did not match the amount of tokens the API usage claimed. In fact, it was off by a lot! Sometimes a factor of two.

Digging into Anthropic’s documentation on extended thinking (https://platform.claude.com/docs/en/build-with-claude/extended-thinking#summarized-thinking), I found that starting with models in the Claude 4 family, “the Messages API for Claude 4 models returns a summary of Claude’s full thinking process”, that “The billed output token count will not match the count of tokens you see in the response.” and that “The first few lines of thinking output are more verbose, providing detailed reasoning”. This exactly matches what I observed: the first few lines of thinking provide detailed reasoning, and then there’s a bit of a ‘phase change’, to what I described as academese. Mystery solved. I was not looking at the raw CoT, but at a summary thereof.

To further confirm this, I ran a similar experiment with Claude Sonnet 3.7, for which the API does provide the raw CoT. As expected, I observed no academese.

This was disappointing and, to be honest, slightly embarrassing! I should have read the docs before spending a few hours (and five bucks worth of tokens) running experiments.

Code at https://github.com/FranciscoHS/cot-academese