Learning When to Reason: Gating LLM Inference for Cost-Efficient Serverless Function Scheduling at Scale

Authors

  • Megan Sullivan The University of Michigan, Ann Arbor, MI 48109, USA
  • Boyang He The University of Michigan, Ann Arbor, MI 48109, USA
  • Patrick Evans The University of Michigan, Ann Arbor, MI 48109, USA

DOI:

https://doi.org/10.54097/gwmv0761

Keywords:

Large language model inference, serverless computing, function scheduling, gating mechanism, cost efficiency, adaptive computation, cloud resource management

Abstract

Serverless computing platforms face a fundamental tension when integrating large language model based scheduling agents: invoking full chain-of-thought reasoning for every function placement decision is computationally prohibitive, yet relying exclusively on lightweight heuristics cannot capture complex workload semantics. This paper presents GateLLM, a learned gating framework that selectively routes incoming serverless scheduling requests to either a lightweight fast-path handler or a deliberative large language model reasoning pipeline based on a compact feature vector extracted from request metadata. The gating classifier is a multilayer perceptron trained through a combination of offline oracle labeling and online reinforcement feedback derived from observed scheduling outcomes, enabling continuous adaptation to workload distribution shifts. Evaluated on a large-scale production function trace, GateLLM reduces total large language model inference cost by 61.3% relative to a full-reasoning baseline while incurring only a 2.1% degradation in average job completion time and a 0.25 percentage point increase in service level objective violation rate. Analysis demonstrates that over 68% of real-world scheduling events are structurally simple and resolvable without large language model involvement, establishing inference gating as an essential primitive for operationally viable intelligent schedulers in production cloud environments.

Downloads

Download data is not yet available.

References

[1] Shahrad, M., Fonseca, R., Goiri, I., Chaudhry, G., Batum, P., Cooke, J., ... & Bianchini, R. (2020). Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20) (pp. 205–218). USENIX Association.

[2] Agache, A., Brooker, M., Iordache, A., Liguori, A., Neugebauer, R., Piwonka, P., & Popa, D. M. (2020). Firecracker: Lightweight virtualization for serverless applications. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 419–434). USENIX Association.

[3] Kaffes, K., Yadwadkar, N. J., & Kozyrakis, C. (2021). Practical scheduling for real-world serverless computing. arXiv preprint arXiv:2111.07226.

[4] Das, A., Leaf, A., Varela, C. A., & Patterson, S. (2020). Skedulix: Hybrid cloud scheduling for cost-efficient execution of serverless applications. In 2020 IEEE 13th International Conference on Cloud Computing (CLOUD) (pp. 609–618). IEEE. https://doi.org/10.1109/CLOUD49709.2020.00093

[5] Ding, G., Yang, S., Lin, H., Chen, Z., & Yang, J. S. (2026). LLM-driven adaptive cloud resource scheduling: Bridging reasoning intelligence with optimization guarantees. IEEE Open Journal of the Computer Society. https://doi.org/10.1109/OJCS.2026.xxxxxx

[6] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

[7] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.

[8] Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V., ... & Metzler, D. (2022). Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35, 17456–17472.

[9] Chen, L., Zaharia, M., & Zou, J. (2023). Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.

[10] Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.

[11] Mao, H., Schwarzkopf, M., Venkatakrishnan, S. B., Meng, Z., & Alizadeh, M. (2019). Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM SIGCOMM 2019 Conference (pp. 270–288). ACM. https://doi.org/10.1145/3341302.3342080

[12] Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (pp. 19274–19286). PMLR.

[13] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[14] Fuerst, A., & Sharma, P. (2021). Faascache: keeping serverless computing alive with greedy-dual caching. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 386–400). ACM. https://doi.org/10.1145/3445814.3446758

[15] Tsenos, M., Peri, A., & Kalogeraki, V. (2023). Energy efficient scheduling for serverless systems. In 2023 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS) (pp. 27–36). IEEE. https://doi.org/10.1109/ACSOS58161.2023.00018

[16] Akbari, S., & Hauswirth, M. (2025). Hiku: Pull-based scheduling for serverless computing. In 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) (pp. 450–461). IEEE.

[17] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

[18] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2022). Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.

[19] Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 46534–46594.

[20] Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney, M. W., Gholami, A., & Keutzer, K. (2023). Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 39236–39256.

[21] Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., ... & Zhang, M. (2023). Efficient large language models: A survey. arXiv preprint arXiv:2312.03863.

[22] Qiu, L. (2025). Reinforcement Learning Approaches for Intelligent Control of Smart Building Energy Systems with Real-Time Adaptation to Occupant Behavior and Weather Conditions. Journal of Computing and Electronic Information Management, 18(2), 32–37.

[23] Shen, Z., Zhao, W., Wang, B., Wang, Z., & Shang, W. (2026). CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference. IEEE Access. https://doi.org/10.1109/ACCESS.2026.xxxxxx

[24] Ding, J., Shen, Z., & Liu, W. (2026). Game-Theoretic Cost-Sensitive Adversarial Training for Robust Cloud Intrusion Detection Against GAN-Based Evasion Attacks. Applied Sciences, 16(8), 3944. https://doi.org/10.3390/app16083944

[25] Ping, W., Jiao, Y., Fan, H., & Zhang, X. (2026). Multimodal Fraud Detection in Financial Statements: A Trimodal Attention Network with Contrastive Evidence Chain Construction. IEEE Access. https://doi.org/10.1109/ACCESS.2026.xxxxxx

[26] Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1–16). IEEE. https://doi.org/10.1109/SC41405.2020.00018

Downloads

Published

09-06-2026

Issue

Section

Articles

How to Cite

Sullivan, M., He, B., & Evans, P. (2026). Learning When to Reason: Gating LLM Inference for Cost-Efficient Serverless Function Scheduling at Scale. Academic Journal of Applied Sciences, 2(1), 39-45. https://doi.org/10.54097/gwmv0761