Coverage for src/pycse/hashcache_v1.py: 0.00%
63 statements
« prev ^ index » next coverage.py v7.11.0, created at 2025-10-23 16:23 -0400
« prev ^ index » next coverage.py v7.11.0, created at 2025-10-23 16:23 -0400
1"""hashcache - a decorator for persistent, file/hash-based cache
3I found some features of joblib were unsuitable for how I want to use a cache.
51. The "file" Python thinks the function is in is used to save the results in
6joblib, which leads to repeated runs if you run the same code in Python,
7notebook or stdin, and means the cache is not portable to other machines, and
8maybe not even in time since temp directories and kernel parameters are
9involved. I could not figure out how to change those in joblib.
112. joblib uses the function source code in the hash, so inconsequential changes
12like whitespace, docstrings and comments change the hash.
14This library aims to provide a simpler version of what I wish joblib did for me.
16Results are cached based on a hash of the function name, argnames, bytecode, arg
17values and kwarg values. I use joblib.hash for this. This means any two
18functions with the same bytecode, even if they have different names, will cache
19to the same result.
21The cache location is set as a function attribute:
23 hashcache.cache = './cache'
26This is alpha, proof of concept code. Test it a lot for your use case. The API
27is not stable, and subject to change.
29Some things to do:
311. the function attributes are kind of weird, maybe these should be decorator
32arguments.
34Pros:
361. File-based cache which means many functions can run in parallel reading and
37writing, and you are limited only by file io speeds, and disk space.
392. semi-portability. The cache could be synced across machines, and caches
40can be merged with little risk of conflict.
423. No server is required. Everything is done at the OS level.
444. Extendability. You can define your own functions for loading and dumping
45data.
47Cons:
491. hashes are fragile and not robust. They are fragile with respect to any
50changes in how byte-code is made, or via mutable arguments, etc. The hashes are
51not robust to system level changes like library versions, or global variables.
52The only advantage of hashes is you can compute them.
542. File-based cache which means if you generate thousands of files, it can be
55slow to delete them. Although it should be fast to access the results since you
56access them directly by path, it will not be fast to iterate over all the
57results, e.g. if you want to implement some kind of search or reporting.
593. No server. You have to roll your own update strategy if you run things on
60multiple machines that should all cache to a common location.
62Changelog
63---------
65[2023-09-23 Sat] Changed hash signature (breaking change). It is too difficult
66to figure out how to capture global state, and the use of internal variable
67names is not consistent with using the bytecode to be insensitive to
68unimportant variable name changes.
70Pulled out some functions for loading and dumping data. This is a precursor to
71enabling other backends like lmdb or sqlite instead of files. You can then
72simply provide new functions for this.
74"""
76import functools
77import inspect
78import joblib
79import os
80from pathlib import Path
81import pprint
82import time
85def get_standardized_args(func, args, kwargs):
86 """Returns a standardized dictionary of kwargs for func(args, kwargs)
88 This dictionary includes default values, even if they were not called.
90 """
91 sig = inspect.signature(func)
92 standardized_args = sig.bind(*args, **kwargs)
93 standardized_args.apply_defaults()
94 return standardized_args.arguments
97def get_hash(func, args, kwargs):
98 """Get a hash for running FUNC(ARGS, KWARGS).
100 This is the most critical feature of hashcache as it provides a key to store
101 and look up results later. You should think carefully before changing this
102 function, it breaks past caches.
104 FUNC should be as pure as reasonable. This hash is insensitive to global
105 variables.
107 The hash is on the function name, bytecode, and a standardized kwargs
108 including defaults. We use bytecode because it is insensitive to things like
109 whitespace, comments, docstrings, and variable name changes that don't
110 affect results. It is assumed that two functions with the same name and
111 bytecode will evaluate to the same result.
113 """
114 return joblib.hash(
115 [
116 func.__code__.co_name, # This is the function name
117 func.__code__.co_code, # this is the function bytecode
118 get_standardized_args(func, args, kwargs), # The args used, including defaults
119 ],
120 hash_name="sha1",
121 )
124def get_hashpath(hsh):
125 """Return path to file for HSH."""
126 cache = Path(hashcache.cache)
127 hshdir = cache / hsh[0:2]
128 hshpath = hshdir / hsh
129 return hshpath
132def load_data(hsh, verbose=False):
133 """Load data for HSH.
135 HSH is a string for the hash associated with the data you want.
137 Returns success, data. If it succeeds, success with be True. If the data
138 does not exist yet, sucess will be False, and data will be None.
140 """
141 hshpath = get_hashpath(hsh)
142 if os.path.exists(hshpath):
143 data = joblib.load(hshpath)
144 if verbose:
145 pp = pprint.PrettyPrinter(indent=4)
146 pp.pprint(data)
147 return True, data["output"]
148 else:
149 return False, None
152def dump_data(hsh, data, verbose):
153 """Dump DATA into HSH."""
154 hshpath = get_hashpath(hsh)
155 os.makedirs(hshpath.parent, exist_ok=True)
157 files = joblib.dump(data, hshpath)
159 if verbose:
160 pp = pprint.PrettyPrinter(indent=4)
161 print(f"wrote {hshpath}")
162 pp.pprint(data)
164 return files
167def hashcache(fn=None, *, verbose=False, loader=load_data, dumper=dump_data):
168 """Cache results by hash of the function, arguments and kwargs.
170 Set hashcache.cache to the directory you want the cache saved in.
171 Default = cache
172 """
174 def wrapper(func, *args, **kwargs):
175 hsh = get_hash(func, args, kwargs)
177 # Try getting the data first
178 success, data = loader(hsh, verbose)
180 if success:
181 return data
183 # we did not succeed, so we run the function, and cache it
184 # We store some metadata for future analysis.
185 t0 = time.time()
186 value = func(*args, **kwargs)
187 tf = time.time()
189 # functions with mutable arguments can change the arguments, which
190 # is a problem here. We just warn the user. Nothing else makes
191 # sense, the mutability may be intentional.
192 if not hsh == get_hash(func, args, kwargs):
193 print("WARNING something mutated, future calls will not use the cache.")
195 # Try a bunch of ways to get a username.
196 try:
197 user = os.getlogin()
198 except OSError:
199 user = os.environ.get("USER")
201 data = {
202 "output": value,
203 "hash": hsh,
204 "func": func.__code__.co_name, # This is the function name
205 "module": func.__module__,
206 "args": args,
207 "kwargs": kwargs,
208 "standardized-kwargs": get_standardized_args(func, args, kwargs),
209 "version": hashcache.version,
210 "cwd": os.getcwd(), # Is this a good idea? Could it leak
211 # sensitive information from the path?
212 # should we include other info like
213 # hostname?
214 "user": user,
215 "run-at": t0,
216 "run-at-human": time.asctime(time.localtime(t0)),
217 "elapsed_time": tf - t0,
218 }
220 dumper(hsh, data, verbose)
221 return value
223 # This silliness is because I want to have the decorator work with and
224 # without arguments
225 #
226 # @hashcache
227 # def f(...)
228 #
229 # and
230 # @hashcache(verbose=True)
231 # def f(...)
232 #
233 # yea, it feels gross.
234 if fn is not None:
235 return functools.partial(wrapper, fn)
236 else:
238 def decorator(func):
239 newrapper = functools.partial(wrapper, func)
240 return functools.update_wrapper(newrapper, func)
242 return decorator
245hashcache.cache = "cache"
246hashcache.version = "0.0.3"