Metadata-Version: 2.4
Name: cuckooget
Version: 0.1.0
Requires-Dist: aiohttp
Requires-Dist: aiofiles
Requires-Dist: beautifulsoup4
Requires-Dist: xxhash
Requires-Dist: ujson
License-File: LICENSE
Summary: A very fast website mirror script.
Author-email: taro <taro@eyes4you.org>
License-Expression: BSD-3-Clause
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Source, https://github.com/haturatu/cuckooget

# cuckooget
```
                      __                                      __      
                     /\ \                                    /\ \__   
  ___   __  __    ___\ \ \/'\     ___     ___      __      __\ \ ,_\  
 /'___\/\ \/\ \  /'___\ \ , <    / __`\  / __`\  /'_ `\  /'__`\ \ \/  
/\ \__/\ \ \_\ \/\ \__/\ \ \\`\ /\ \L\ \/\ \L\ \/\ \L\ \/\  __/\ \ \_ 
\ \____\\ \____/\ \____\\ \_\ \_\ \____/\ \____/\ \____ \ \____\\ \__\
 \/____/ \/___/  \/____/ \/_/\/_/\/___/  \/___/  \/___L\ \/____/ \/__/
                                                   /\____/            
                                                   \_/__/             
```
## What
A very fast website copy script using a cuckoo hash table & xxhash & DAG. There are still many problems.
I feel sad about disappearing websites, and I’m thinking of ways to save them even faster.  
  
*Websites are our memories.*  
Let everyone rise up and preserve disappearing historical websites, leaving them for the future.  
For all geeks and for those who love the internet. If you find an interesting website, please contact me.  
  
Furthermore, with the `-w` option, you can set higher priorities based on the URL. I don't think other website mirroring software has this feature.
  
Collisions are avoided by the cuckoo hash table and generated by the ultra-fast xxhash.
It consists of xxh32 and xxh64 as different hash values.    

DeepWiki: [https://deepwiki.com/haturatu/cuckooget](https://deepwiki.com/haturatu/cuckooget)
  
## Install

### Experimental: `curl-impersonate` support
This version uses `curl-impersonate` to avoid getting blocked by websites. It mimics the TLS/JA3 fingerprint of a real browser.

To use this feature, please checkout the `feat/curl-impersonate` branch.

```bash
git checkout feat/curl-impersonate
make && make install
```

deps: 
```
curl https://pyenv.run | bash

pyenv install 3.12.3
python -m pip install maturin
```
### GNU Make
I recommend installing it using GNU Make.  
```bash
make
make install
```

For editable install:

```bash
make develop
```

### Bash
```bash
chmod +x install.sh
./install.sh
```

## Usage
```
$ ck -h
usage: ck [-h] [-c CONNECTIONS] [-w WEIGHTS [WEIGHTS ...]] [-v EXCLUDE [EXCLUDE ...]]
          [-f]
          url output_dir

Mirrors a website.

positional arguments:
  url                   URL of the website to mirror
  output_dir            Directory to save the mirrored files

options:
  -h, --help            show this help message and exit
  -c CONNECTIONS, --connections CONNECTIONS
                        Number of simultaneous connections (default: 50)
  -w WEIGHTS [WEIGHTS ...], --weights WEIGHTS [WEIGHTS ...]
                        Strings to set URL priorities (can specify multiple separated
                        by spaces)
  -v EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
                        URL patterns to exclude (can specify multiple separated by
                        spaces)
  -f, --force           Force re-download even if the download was already completed
```

