Skip to content

Conversation

@svvimming
Copy link
Contributor

@svvimming svvimming commented May 6, 2023

Description

Two main features have been added to the CID importer cron:

  1. Backups of entire CID files retrieved from Web3.storage to the Open Panda backblaze bucket

  2. Multithreaded processing:

  • CID retrieval from Web3.Storage, zst unpacking, metadata extraction and import to the database are now handled in a worker thread (cid-batch-import.js in the crons directory). The first part of the main cid-importer.js script is still the same; a manifest list of CIDs to download is still generated and stored to tmp/cid-files/cid-manifest.txt. However, where retrieving the files from the manifest list was previously handled in batches processed in series, now the script delegates batches out to worker threads to process in parallel. Two new arguments can be passed to the cid-importer.js script: --threads followed by the integer number of workers to add and a boolean argument --all, which, if true, skips the search for the last imported document in the database and retrieves all CIDs starting from the oldest existing upload. The previous two arguments, both which still apply, are; --pagesize - an integer specifying import/backup batch size and --maxpages - an integer to specify how many batches to process; if left unspecified, no limit will be placed on the number of batches.

Ticket link

https://www.notion.so/agencyundone/Backup-all-dataset-manifests-to-Backblaze-3196a93f141546a3a91602d78b3dbd7f?pvs=4

@svvimming svvimming requested a review from justanothersynth May 6, 2023 01:46
@svvimming svvimming self-assigned this May 6, 2023
@svvimming
Copy link
Contributor Author

closing in favor of #68

@svvimming svvimming closed this Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants