Coverage for src/pycse/hashcache_v1.py: 0.00%

63 statements  

« prev     ^ index     » next       coverage.py v7.11.0, created at 2025-10-23 16:23 -0400

1"""hashcache - a decorator for persistent, file/hash-based cache 

2 

3I found some features of joblib were unsuitable for how I want to use a cache. 

4 

51. The "file" Python thinks the function is in is used to save the results in 

6joblib, which leads to repeated runs if you run the same code in Python, 

7notebook or stdin, and means the cache is not portable to other machines, and 

8maybe not even in time since temp directories and kernel parameters are 

9involved. I could not figure out how to change those in joblib. 

10 

112. joblib uses the function source code in the hash, so inconsequential changes 

12like whitespace, docstrings and comments change the hash. 

13 

14This library aims to provide a simpler version of what I wish joblib did for me. 

15 

16Results are cached based on a hash of the function name, argnames, bytecode, arg 

17values and kwarg values. I use joblib.hash for this. This means any two 

18functions with the same bytecode, even if they have different names, will cache 

19to the same result. 

20 

21The cache location is set as a function attribute: 

22 

23 hashcache.cache = './cache' 

24 

25 

26This is alpha, proof of concept code. Test it a lot for your use case. The API 

27is not stable, and subject to change. 

28 

29Some things to do: 

30 

311. the function attributes are kind of weird, maybe these should be decorator 

32arguments. 

33 

34Pros: 

35 

361. File-based cache which means many functions can run in parallel reading and 

37writing, and you are limited only by file io speeds, and disk space. 

38 

392. semi-portability. The cache could be synced across machines, and caches 

40can be merged with little risk of conflict. 

41 

423. No server is required. Everything is done at the OS level. 

43 

444. Extendability. You can define your own functions for loading and dumping 

45data. 

46 

47Cons: 

48 

491. hashes are fragile and not robust. They are fragile with respect to any 

50changes in how byte-code is made, or via mutable arguments, etc. The hashes are 

51not robust to system level changes like library versions, or global variables. 

52The only advantage of hashes is you can compute them. 

53 

542. File-based cache which means if you generate thousands of files, it can be 

55slow to delete them. Although it should be fast to access the results since you 

56access them directly by path, it will not be fast to iterate over all the 

57results, e.g. if you want to implement some kind of search or reporting. 

58 

593. No server. You have to roll your own update strategy if you run things on 

60multiple machines that should all cache to a common location. 

61 

62Changelog 

63--------- 

64 

65[2023-09-23 Sat] Changed hash signature (breaking change). It is too difficult 

66to figure out how to capture global state, and the use of internal variable 

67names is not consistent with using the bytecode to be insensitive to 

68unimportant variable name changes. 

69 

70Pulled out some functions for loading and dumping data. This is a precursor to 

71enabling other backends like lmdb or sqlite instead of files. You can then 

72simply provide new functions for this. 

73 

74""" 

75 

76import functools 

77import inspect 

78import joblib 

79import os 

80from pathlib import Path 

81import pprint 

82import time 

83 

84 

85def get_standardized_args(func, args, kwargs): 

86 """Returns a standardized dictionary of kwargs for func(args, kwargs) 

87 

88 This dictionary includes default values, even if they were not called. 

89 

90 """ 

91 sig = inspect.signature(func) 

92 standardized_args = sig.bind(*args, **kwargs) 

93 standardized_args.apply_defaults() 

94 return standardized_args.arguments 

95 

96 

97def get_hash(func, args, kwargs): 

98 """Get a hash for running FUNC(ARGS, KWARGS). 

99 

100 This is the most critical feature of hashcache as it provides a key to store 

101 and look up results later. You should think carefully before changing this 

102 function, it breaks past caches. 

103 

104 FUNC should be as pure as reasonable. This hash is insensitive to global 

105 variables. 

106 

107 The hash is on the function name, bytecode, and a standardized kwargs 

108 including defaults. We use bytecode because it is insensitive to things like 

109 whitespace, comments, docstrings, and variable name changes that don't 

110 affect results. It is assumed that two functions with the same name and 

111 bytecode will evaluate to the same result. 

112 

113 """ 

114 return joblib.hash( 

115 [ 

116 func.__code__.co_name, # This is the function name 

117 func.__code__.co_code, # this is the function bytecode 

118 get_standardized_args(func, args, kwargs), # The args used, including defaults 

119 ], 

120 hash_name="sha1", 

121 ) 

122 

123 

124def get_hashpath(hsh): 

125 """Return path to file for HSH.""" 

126 cache = Path(hashcache.cache) 

127 hshdir = cache / hsh[0:2] 

128 hshpath = hshdir / hsh 

129 return hshpath 

130 

131 

132def load_data(hsh, verbose=False): 

133 """Load data for HSH. 

134 

135 HSH is a string for the hash associated with the data you want. 

136 

137 Returns success, data. If it succeeds, success with be True. If the data 

138 does not exist yet, sucess will be False, and data will be None. 

139 

140 """ 

141 hshpath = get_hashpath(hsh) 

142 if os.path.exists(hshpath): 

143 data = joblib.load(hshpath) 

144 if verbose: 

145 pp = pprint.PrettyPrinter(indent=4) 

146 pp.pprint(data) 

147 return True, data["output"] 

148 else: 

149 return False, None 

150 

151 

152def dump_data(hsh, data, verbose): 

153 """Dump DATA into HSH.""" 

154 hshpath = get_hashpath(hsh) 

155 os.makedirs(hshpath.parent, exist_ok=True) 

156 

157 files = joblib.dump(data, hshpath) 

158 

159 if verbose: 

160 pp = pprint.PrettyPrinter(indent=4) 

161 print(f"wrote {hshpath}") 

162 pp.pprint(data) 

163 

164 return files 

165 

166 

167def hashcache(fn=None, *, verbose=False, loader=load_data, dumper=dump_data): 

168 """Cache results by hash of the function, arguments and kwargs. 

169 

170 Set hashcache.cache to the directory you want the cache saved in. 

171 Default = cache 

172 """ 

173 

174 def wrapper(func, *args, **kwargs): 

175 hsh = get_hash(func, args, kwargs) 

176 

177 # Try getting the data first 

178 success, data = loader(hsh, verbose) 

179 

180 if success: 

181 return data 

182 

183 # we did not succeed, so we run the function, and cache it 

184 # We store some metadata for future analysis. 

185 t0 = time.time() 

186 value = func(*args, **kwargs) 

187 tf = time.time() 

188 

189 # functions with mutable arguments can change the arguments, which 

190 # is a problem here. We just warn the user. Nothing else makes 

191 # sense, the mutability may be intentional. 

192 if not hsh == get_hash(func, args, kwargs): 

193 print("WARNING something mutated, future calls will not use the cache.") 

194 

195 # Try a bunch of ways to get a username. 

196 try: 

197 user = os.getlogin() 

198 except OSError: 

199 user = os.environ.get("USER") 

200 

201 data = { 

202 "output": value, 

203 "hash": hsh, 

204 "func": func.__code__.co_name, # This is the function name 

205 "module": func.__module__, 

206 "args": args, 

207 "kwargs": kwargs, 

208 "standardized-kwargs": get_standardized_args(func, args, kwargs), 

209 "version": hashcache.version, 

210 "cwd": os.getcwd(), # Is this a good idea? Could it leak 

211 # sensitive information from the path? 

212 # should we include other info like 

213 # hostname? 

214 "user": user, 

215 "run-at": t0, 

216 "run-at-human": time.asctime(time.localtime(t0)), 

217 "elapsed_time": tf - t0, 

218 } 

219 

220 dumper(hsh, data, verbose) 

221 return value 

222 

223 # This silliness is because I want to have the decorator work with and 

224 # without arguments 

225 # 

226 # @hashcache 

227 # def f(...) 

228 # 

229 # and 

230 # @hashcache(verbose=True) 

231 # def f(...) 

232 # 

233 # yea, it feels gross. 

234 if fn is not None: 

235 return functools.partial(wrapper, fn) 

236 else: 

237 

238 def decorator(func): 

239 newrapper = functools.partial(wrapper, func) 

240 return functools.update_wrapper(newrapper, func) 

241 

242 return decorator 

243 

244 

245hashcache.cache = "cache" 

246hashcache.version = "0.0.3"