Skip to content

functools._make_key is slow due to creation of lots of temporary tuples - optimizable with list #143825

@heikkitoivonen

Description

@heikkitoivonen

Feature or enhancement

Proposal:

First, I am not sure how important this is because even though _make_key could be optimized, AFAIK only _lru_cache_wrapper uses it, and that may be overwritten by this import:

try:
    from _functools import _lru_cache_wrapper
except ImportError:
    pass

If we do think it is worth optimizing, then this change (I can make a PR if needed):

@@ -537,18 +537,17 @@ def _make_key(args, kwds, typed,
     # Formerly, we sorted() the kwds before looping.  The new way is *much*
     # faster; however, it means that f(x=1, y=2) will now be treated as a
     # distinct call from f(y=2, x=1) which will be cached separately.
-    key = args
+    key = list(args) if kwds or typed else args
     if kwds:
-        key += kwd_mark
-        for item in kwds.items():
-            key += item
+        key.append(kwd_mark)
+        key.extend(kwds.items())
     if typed:
-        key += tuple(type(v) for v in args)
+        key.extend([type(v) for v in args])
         if kwds:
-            key += tuple(type(v) for v in kwds.values())
+            key.extend([type(v) for v in kwds.values()])
     elif len(key) == 1 and type(key[0]) in fasttypes:
         return key[0]
-    return key
+    return tuple(key)
 
 def lru_cache(maxsize=128, typed=False):
     """Least-recently-used cache decorator.

makes _make_key significantly faster:

from functools import _make_key as mk

timeit mk((), {}, False)
257 ns ± 0.645 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, False)
861 ns ± 0.597 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, True)
3.03 μs ± 6.85 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

to

timeit mk((), {}, False)
239 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.08x faster

timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, False)
700 ns ± 2.74 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.23x faster

timeit mk((1, 2, 3), {1:2, 2:3, 3:4}, True)
1.25 μs ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
2.42x faster

If we remove the import, then a benchmark for lru_cache, which is the public API of this code, shows 1.01x to 1.68x improvement depending on use case:

lru_cache_baseline.json
=======================

Python version: 3.15.0a3+ (64-bit) revision e0fb2780649
Report on macOS-14.6.1-arm64-arm-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2026-01-13 20:58:56.174556
End date: 2026-01-13 21:02:28.432336

lru_cache_optimized.json
========================

Python version: 3.15.0a3+ (64-bit) revision e0fb2780649
Report on macOS-14.6.1-arm64-arm-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2026-01-13 21:04:47.595674
End date: 2026-01-13 21:08:21.508782

### lru_cache_hit_args_and_kwargs ###
Mean +- std dev: 104 us +- 1 us -> 103 us +- 1 us: 1.01x faster
Not significant

### lru_cache_hit_args_and_multiple_kwargs ###
Mean +- std dev: 126 us +- 1 us -> 113 us +- 1 us: 1.12x faster
Significant (t=65.57)

### lru_cache_hit_multiple_args ###
Mean +- std dev: 68.5 us +- 0.9 us -> 66.1 us +- 0.6 us: 1.04x faster
Significant (t=17.21)

### lru_cache_hit_multiple_kwargs ###
Mean +- std dev: 121 us +- 1 us -> 106 us +- 1 us: 1.14x faster
Significant (t=91.32)

### lru_cache_hit_single_float ###
Mean +- std dev: 71.7 us +- 0.6 us -> 69.5 us +- 1.1 us: 1.03x faster
Significant (t=13.89)

### lru_cache_hit_single_int ###
Mean +- std dev: 69.0 us +- 0.7 us -> 65.4 us +- 0.7 us: 1.05x faster
Significant (t=27.70)

### lru_cache_hit_single_kwarg ###
Mean +- std dev: 98.3 us +- 2.7 us -> 95.1 us +- 0.9 us: 1.03x faster
Significant (t=8.68)

### lru_cache_hit_single_str ###
Mean +- std dev: 69.8 us +- 0.7 us -> 66.7 us +- 0.5 us: 1.05x faster
Significant (t=27.05)

### lru_cache_hit_typed_args_and_kwargs ###
Mean +- std dev: 247 us +- 2 us -> 147 us +- 1 us: 1.68x faster
Significant (t=367.54)

### lru_cache_hit_typed_multiple_args ###
Mean +- std dev: 143 us +- 2 us -> 103 us +- 1 us: 1.38x faster
Significant (t=158.01)

### lru_cache_hit_typed_single_arg ###
Mean +- std dev: 123 us +- 1 us -> 92 us +- 0 us: 1.34x faster
Significant (t=212.03)

### lru_cache_hit_typed_single_kwarg ###
Mean +- std dev: 213 us +- 2 us -> 128 us +- 2 us: 1.66x faster
Significant (t=233.02)

### lru_cache_miss_single_int ###
Mean +- std dev: 68.5 us +- 0.8 us -> 65.0 us +- 0.5 us: 1.05x faster
Significant (t=29.21)

### lru_cache_miss_single_kwarg ###
Mean +- std dev: 96.6 us +- 0.7 us -> 94.4 us +- 0.5 us: 1.02x faster
Significant (t=19.86)

### lru_cache_miss_typed_args_kwargs ###
Mean +- std dev: 237 us +- 3 us -> 143 us +- 1 us: 1.65x faster
Significant (t=260.03)

The benchmark was written by Claude Code, and since it is 233 lines long I am attaching it rather than inlining.

bm_lru_cache.py

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usage

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions