Estimating AI Energy Use

Estimating AI Power Savings

Green Robot can drastically reduce AI energy use for typical workloads. That’s possible for two reasons:

Big AI models use a lot more power than small ones — even when a small one would do. A small model can use a fraction of the energy of a frontier model for the same answer, when the question is simple enough that the small model is up to it. Today’s smaller AI models are very capable, about a year behind the state of the art, but can run at greatly reduced power.
“Reasoning” mode burns through a huge amount of extra power. When a modern AI model “thinks step by step,” it can generate many times more text behind the scenes than the answer shown. That has a direct, measured energy cost. For easy questions, that thinking is unnecessary.

Green Robot classifies each incoming query and sends it to the most efficient AI that can produce a high-quality answer. If a query doesn’t need deep reasoning, we disable it. For harder queries we still use the frontier models, to get the highest quality results. The savings come from not using a 747 to deliver a pizza.

Research overview

Content-aware AI routing systems like Green Robot have been studied for some time.

RouteLLM is a peer-reviewed framework published at ICLR 2025 by researchers at UC Berkeley, Anyscale, and Canva. They built a router that decides, for each query, whether to send it to a strong frontier model or a weaker, cheaper one. On a standard benchmark for chat-style queries, their router achieved 95% of GPT-4’s quality while reducing cost by 85%. ¹

An earlier system called FrugalGPT showed a more dramatic version of the same idea: matching GPT-4’s performance with up to 98% lower cost by intelligently chaining smaller models. ²

A separate 2025 report from UNESCO and University College London ran controlled experiments on open-source models and found that using small task-specific models can cut energy use by up to 90% without losing accuracy, compared to using a single large general-purpose model for everything. ³

Reasoning is by far the biggest energy lever

In December 2025, researchers at Hugging Face published AI Energy Score v2, a standardized benchmark of energy use across modern language models. Their headline finding:

Reasoning-mode AI uses about 30 times more energy on average than the same kind of model without reasoning enabled.
For specific models, turning reasoning on increased energy use by 150 to nearly 700 times for the same query.
This is mostly because reasoning models generate 300 to 800 times more text behind the scenes than they show in their answer. ⁴

That’s the largest efficiency lever we have available. By turning reasoning off for queries that don’t need it, and dialing it down for queries that need only a little, Green Robot captures most of that opportunity. Because reasoning is increasingly turned on by default in frontier products, even moderate use of this lever delivers outsized savings.

How big the underlying gap is

Two further studies establish the size of the model-to-model gap that routing exploits:

TokenPowerBench (accepted at AAAI 2026) measured energy use across language models from 1 billion to 405 billion parameters on production GPUs. Energy per generated token rises sharply with model size, and inference accounts for more than 90% of an AI model’s lifetime power consumption. ⁵
A January 2026 study from researchers at the University of Michigan, “Where Do the Joules Go?”, ran nearly 2,000 measurements across 46 AI models. They found that the type of task alone can cause up to 25 times difference in energy use, even on the same model. ⁶

What real AI queries cost today

For context on absolute numbers:

Google disclosed in August 2025 that the median text query to its Gemini service uses 0.24 watt-hours of energy — roughly the energy to watch nine seconds of TV. Google’s number includes data center cooling, idle capacity, and other real-world overhead. ⁷
Epoch AI independently estimated a typical ChatGPT query at around 0.3 watt-hours. ⁸

Small per-query numbers, but they add up across billions of queries per day. Most of that volume is going through models and reasoning settings sized for the hardest queries — not the average one.

Ways to control AI power use

Lever	What the research shows	Source
Routing simple queries to smaller models	Up to 85% cost reduction at 95% quality on chat benchmarks	RouteLLM ¹
Using task-appropriate models instead of one big model for everything	Up to 90% energy reduction	UNESCO/UCL ³
Disabling reasoning when not needed	~30× energy reduction on average per affected query	AI Energy Score v2 ⁴
Combined techniques applied together	Up to 90% combined	UNESCO/UCL ³

Real workloads are a mix. Some queries genuinely need a frontier model with full reasoning; nothing we do helps with those. Others are simple lookups where a small model with no reasoning is plenty. Most fall somewhere in between.

In practice, savings depend on workload composition. Workloads dominated by simple, high-volume queries — FAQ chatbots, classification, short-text summarization, routine extraction — push toward the upper end. Workloads dominated by genuinely hard, multi-step problems push toward the lower end, because we route those to a frontier model with reasoning.

References

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2025). RouteLLM: Learning to Route LLMs with Preference Data. International Conference on Learning Representations (ICLR) 2025. arxiv.org/abs/2406.18665 ↩
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arxiv.org/abs/2305.05176 ↩
UNESCO and University College London (July 2025). Smarter, Smaller, Stronger: Resource-Efficient AI and the Future of Digital Transformation. unesco.org/en/articles/ai-large-language-models-new-report-shows-small-changes-can-reduce-energy-use-90 ↩
Luccioni, S., & Gamazaychikov, B. (December 2025). AI Energy Score v2: Refreshed Leaderboard, now with Reasoning. Hugging Face. huggingface.co/blog/sasha/ai-energy-score-v2 ↩
Niu, C., Zhang, W., Li, J., Zhao, Y., Wang, T., Wang, X., & Chen, Y. (December 2025). TokenPowerBench: Benchmarking the Power Consumption of LLM Inference. Accepted at AAAI 2026. arxiv.org/abs/2512.03024 ↩
Chung, J.-W., Wu, R., Ma, J. J., & Chowdhury, M. (January 2026). Where Do the Joules Go? Diagnosing Inference Energy Consumption. arxiv.org/abs/2601.22076 ↩
Vahdat, A., & Dean, J. (August 2025). Measuring the environmental impact of AI inference. Google Cloud Blog. cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference ↩
You, J. (February 2025). How much energy does ChatGPT use? Epoch AI. epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use ↩