* To run a quick set of tests:

python setup.py build_ext -i
ZS_QUICK_TEST=1 nosetests --all-modules zs

* To run a complete set of tests with HTML coverage output:

nosetests --all-modules --with-cov --cov-conf .coveragerc --cov-report term-missing --cov-report html zs

* windows compilation
https://github.com/cython/cython/wiki/64BitCythonExtensionsOnWindows
https://matthew-brett.github.io/pydagogue/python_msvc.html

* Next

** windows wheels? (and also for backports.lzma?)

** once RTD is straightened out, make backports.lzma a proper requirement

** figure out how to get cython coverage (CYTHON_TRACE?)

** check whether the 50 MIB/s/worker we get from both deflate and lzma is b/c of being bottlenecked on unpack_data_records

* LZMA
** proper tuning

using the 3gram "sk" file (not dump --prefix="sk", but the actual file), generated files with varying codec and blocksize options, and measured dump time

three runs on truffle, so -j8 and local disks

| alg  |   bs | size (MiB) | run1 (s) |   run2 |   run3 |   mean |     sd |
|------+------+------------+----------+--------+--------+--------+--------|
| bz2  |   32 |        473 |   76.890 | 57.132 | 58.520 | 64.181 | 11.028 |
|      |   64 |        463 |   42.122 | 38.923 | 38.457 | 39.834 |  1.995 |
|      |  128 |        466 |   29.905 | 29.740 | 29.715 | 29.787 |  0.103 |
|      |  256 |        475 |   33.126 | 32.906 | 29.835 | 31.956 |  1.840 |
|      |  384 |        484 |   33.657 | 34.556 | 36.592 | 34.935 |  1.504 |
|      |  512 |        490 |   34.477 | 34.826 | 33.886 | 34.396 |  0.475 |
|      |  768 |        500 |   36.331 | 37.478 | 35.336 | 36.382 |  1.072 |
|------+------+------------+----------+--------+--------+--------+--------|
| lzma |  128 |        411 |   25.245 | 28.328 | 27.970 | 27.181 |  1.686 |
| 0e   |  256 |        397 |   23.853 | 20.536 | 22.477 | 22.289 |  1.667 |
|      |  384 |        391 |   23.147 | 22.018 | 19.267 | 21.477 |  1.996 |
|      |  512 |        388 |   19.505 | 21.185 | 21.543 | 20.744 |  1.088 |
|      |  768 |        385 |   21.649 | 18.444 | 18.058 | 19.384 |  1.971 |
|------+------+------------+----------+--------+--------+--------+--------|
| lzma |  128 |            |          |        |        |  0.000 |  0.000 |
| 1e   |  256 |        397 |          |        |        |  0.000 |  0.000 |
|      |  384 |        391 |          |        |        |  0.000 |  0.000 |
|      |  512 |        388 |          |        |        |  0.000 |  0.000 |
|      |  768 |        385 |          |        |        |  0.000 |  0.000 |
|      | 1024 |        382 |          |        |        |  0.000 |  0.000 |
#+TBLFM: $7=vmean($4..$6);%.3f::$8=vsdev($4..$6);%.3f

Observations:
- for bz2
  - 128k is obviously the speed sweet spot -- larger block sizes make things slower (b/c the BWT complexity grows with the size of the data, I guess)
  - weirdly, it is also approximately the sweet spot on file size -- larger block sizes also produce larger files (very weird! I have no explanation for this)
- for lzma
  - going from 128 -> 256 makes things much faster (20%?!), with smaller but possibly reliable gains after that
  - larger block sizes make files smaller: big jump 128->256, smaller from 256->384, and even smaller after that. (256 is the lzma context size at 0e, so this also makes some sense)
  - 1e has basically no size advantage -- at 768k, the actual sizes are 402655885 (1e) versus 403639270 (0e) -- so a little smaller, but less than 0.5%. the gap does grow with increasing block size, but it's still negligble.

For 1e we might expect more of a win for larger blocksizes, at least on file size

for BS in 384 512 768; do for CODEC in lzma bz2; do gunzip -c google-books-eng-us-all-20120701/3gram/sorted-googlebooks-eng-us-all-3gram-20120701-sk.gz | zs make '{}' - 3gram-sk-${BS}k-${CODEC}.zs --approx-block-size=$(($BS * 1024)) --codec $CODEC; done; done

for codec in ["bz2"]:
    for s in [32, 128, 64]:
        times = []
        for i in xrange(3):
            z = ZS("3gram-sk-%sk-%s.zs" % (s, codec))
            start = time.time()
            z.dump(open("/dev/null", "wb"))
            times.append(time.time() - start)
            z.close()
        print("| %s | %s |   | %0.3f | %0.3f | %0.3f |  |  |"
              % (codec, s, times[0], times[1], times[2]))

** general benchmarks
on the 'old/sorted' test vector, using branna, and python3

| alg      |    size | compress time | decompress time |
|----------+---------+---------------+-----------------|
| bz2 -1   | 1036222 | 774 ms        | 153 ms          |
| bz2 -9   | 1140258 | 799 ms        | 208 ms          |
| lzma -0  | 1122836 | 340 ms        | 101 ms          |
| lzma -1  | 1109110 | 396 ms        | 95.7 ms         |
| lzma -0e |  810799 | 4520 ms       | 104 ms          |
| lzma -1e |  805229 | 5900 ms       | 103 ms          |
| deflate  | 1441904 | 323 ms        | 20.6 ms         |

using old/smalltest:

the dump times here are highly variable...

| alg     | blocksize |     size | make walltime | dump walltime |
|---------+-----------+----------+---------------+---------------|
| lzma 0e | 128 KiB   | 27663364 | 42 s          | 2.387 s       |
| lzma 0e | 1024 KiB  | 27197990 | 50 s          | 1.993 s       |
| lzma 0e | 2048 KiB  | 27173018 | 52 s          |               |

for single zpayloads on my laptop, lzma.decompress is 16 ms for the 1M uncompressed block size, and 2.5 ms for the 128K uncompressed block size.

maybe 512K -> 10 ms -> about the same as a disk seek?

* Better parallelism

** Solving the control-C problem (and friends) by writing a better process pool
reader.py's process pool is terrible, esp. on py2

- if you are using the command-line tools, control-C tends to just blow up
- if you are in ipython, then control-C destroys your ZS object

a better pool would be:
- spawn our own processes with custom loops, a la writer.py
- for these children, set SIGINT to SIG_IGN, so we just don't have to deal with it.
- maybe have them poll the parent process occasionally to make sure that it's still alive. having the parent process unconditionally kill them on exit gets us a good part of the way, but can we do even better? heartbeats? keep a pipe around?
  - some options of varying cleverness and portability: https://stackoverflow.com/questions/284325/how-to-make-child-process-die-after-parent-exits
  - most reliable (but not quite mentioned there) seems to be to pass the parent's PID into the child, and then have the child poll on getppid().
    easiest way to do this would be for each worker to spawn a thread that just does this and calls os.exit if the parent goes.
  - windows approach: https://stackoverflow.com/a/12942797/1925449
    http://msdn.microsoft.com/en-us/library/ms684161%28VS.85%29.aspx
    requires win32api and some way to get a process handle for multiprocessing child
- and if a worker just dies (which should never happen), then close the parent, because who knows what got lost and this should be a very unusual event. but we do need to watch for it when blocking on get() in the main thread.

** A wackier alternative
if we spend a lot of time in IPC for a streaming reads, it's possible we should switch to threads instead of processes.

this wouldn't be too hard -- the trick would be to write a tiny circular queue in Cython, start threads using the standard Python APIs, and have all the threads execute a Cython loop that drops the GIL and then pulls char*'s off the queue and puts back decompressed char*'s.

we might even get away with using the GIL to serialize access to the queue, and just using Cython to get a GIL-dropping interface to lzma_decompress etc. Actually I guess the lzma module already has per-compressor locking, it's just zlib and bz2 that are unhelpful this way.
