parselmouth is a utility designed to facilitate the mapping of Conda package names to their corresponding PyPI names and the inverse. This tool automates the process of generating and updating mappings on an hourly basis, ensuring that users have access to the most accurate and up-to-date information.
Test the complete pipeline locally with MinIO (S3-compatible storage):
# One-command start (recommended) - starts MinIO + interactive mode
pixi run test-interactive
# Or run manually with more control:
# 1. Start MinIO
docker-compose up -d
# 2. Run with defaults (pytorch, noarch, package names starting with 't')
pixi run test-pipeline
# 3. Test with conda-forge, package names starting with 'n' (numpy, napari, etc.)
pixi run test-pipeline --channel conda-forge --letter n
# 4. Test incrementally (skip packages already in MinIO)
pixi run test-pipeline --mode incremental
# 5. Test with all packages in a subdir (warning: can be slow!)
pixi run test-pipeline --channel bioconda --letter all
# Multiple channels can coexist in the same bucket (separated by path prefixes)
# Access MinIO UI at http://localhost:9001 (minioadmin / minioadmin)
# Clean up when done
pixi run clean-all # Everything + stop MinIO
pixi run clean-local-data # Just cache + outputs (keep MinIO)See docs/LOCAL_TESTING.md for detailed information.
Example of mapping for numpy-1.26.4-py311h64a7726_0.conda with sha256 3f4365e11b28e244c95ba8579942b0802761ba7bb31c026f50d1a9ea9c728149
{
"pypi_normalized_names": ["numpy"],
"versions": {
"numpy": "1.26.4"
},
"conda_name": "numpy",
"package_name": "numpy-1.26.4-py311h64a7726_0.conda",
"direct_url": [
"https://github.com/numpy/numpy/releases/download/v1.26.4/numpy-1.26.4.tar.gz"
]
}A more simplified version of our mapping is stored here: files/mapping_as_grayskull.json
Example of mapping requests to the corresponding conda versions is, this shows you the known conda names per PyPI version, if a version is missing it is not available on that conda channel:
{
"2.10.0": ["requests"],
"2.11.0": ["requests"],
"2.11.1": ["requests"],
"2.12.0": ["requests"],
"2.12.1": ["requests"],
"2.12.4": ["requests"],
"2.12.5": ["requests"],
"2.13.0": ["requests"],
"2.17.3": ["requests"],
"2.18.1": ["requests"],
"2.18.2": ["requests"],
"2.18.3": ["requests"],
"2.18.4": ["requests"],
"2.19.0": ["requests"],
"2.19.1": ["requests"],
"2.20.0": ["requests"],
"2.20.1": ["requests"],
"2.21.0": ["requests"],
"2.22.0": ["requests"],
"2.23.0": ["requests"],
"2.9.2": ["requests"],
"2.27.1": ["requests", "arm_pyart"],
"2.24.0": ["requests", "google-cloud-bigquery-storage-core"],
"2.26.0": ["requests"],
"2.25.1": ["requests"],
"2.25.0": ["requests"],
"2.27.0": ["requests"],
"2.28.0": ["requests"],
"2.28.1": ["requests"],
"2.31.0": ["requests", "jupyter-sphinx"],
"2.28.2": ["requests"],
"2.29.0": ["requests"],
"2.32.1": ["requests"],
"2.32.2": ["requests"],
"2.32.3": ["requests"]
}There are currently two mappings that are online, one of which is work in progress (#2) and are available behind the following URL:
https://conda-mapping.prefix.dev/:
-
The Conda - PyPI name mapping that maps a conda package version and name to it's known PyPI counterpart.
This is available at
https://conda-mapping.prefix.dev/conda-forge/hash-v0/{sha256}where the{sha256}is the sha256 of the conda package, taken from a package record from the channelsrepodata.jsonfile.So, for example, to find the PyPI name of
numpy-1.26.4-py310h4bfa8fc_0.condayou can use the following URI:https://conda-mapping.prefix.dev/hash-v0/914476e2d3273fdf9c0419a7bdcb7b31a5ec25949e4afbc847297ff3a50c62c8 -
(WIP) The PyPI - Conda name mapping that maps a PyPI package to it's known Conda counterpart. This only works for packages that are available on the conda channels that it references. This is available at
https://conda-mapping.prefix.dev/pypi-to-conda-v1/{channel}/{pypi-normalized-name}.jsonwhere the channel is the name of the channel and the{pypi-normalized-name}is the normalized name of the package on PyPI. E.g forrequestswe can usehttps://conda-mapping.prefix.dev/pypi-to-conda-v1/conda-forge/requests.json, which will give you the corresponding json.
Parselmouth uses two primary storage locations:
The main package mapping data is stored in Cloudflare R2 (S3-compatible storage), configured via the R2_PREFIX_BUCKET environment variable. The bucket contains:
Hash-based Mappings (v0):
hash-v0/{channel}/index.json- Channel-specific index containing all package hasheshash-v0/{package_sha256}- Individual mapping entries keyed by conda package SHA256 hash
Relations Tables (v1):
relations-v1/{channel}/relations.jsonl.gz- Master relations table (JSONL format, gzipped)relations-v1/{channel}/metadata.json- Metadata about the relations tablepypi-to-conda-v1/{channel}/{pypi_name}.json- Fast PyPI lookup files derived from relations table
The files/ directory in the repository stores compressed mappings that are committed to version control:
files/mapping_as_grayskull.json- Legacy mapping format for Grayskull compatibilityfiles/compressed_mapping.json- Compressed mapping (legacy format)files/v0/{channel}/compressed_mapping.json- Channel-specific compressed mappings (conda-forge, pytorch, bioconda)
Parselmouth uses a versioned approach to support multiple data formats:
v0 (Current Hash-based System):
- Uses conda package SHA256 hashes as keys
- Direct lookup:
hash-v0/{sha256}returns a single mapping entry - Optimized for conda → PyPI lookups
- Both old and new workflows write to this path
v1 (Relations System - New):
- Stores package relationships in a normalized table format
- Enables PyPI → conda lookups and dependency analysis
- Three-tier structure:
- Master relations table (source of truth)
- Metadata (statistics, generation timestamp)
- Derived lookup files (cached for performance)
- Only new workflows with
update_relations_tablejob write to this path
The GitHub Actions workflows are organized into stages:
-
Producer Stage (
generate_hash_letters):- Identifies missing packages by comparing upstream channel repodata with existing index
- Outputs a matrix of
subdir@lettercombinations to process in parallel
-
Updater Stage (
updater_of_records):- Runs in parallel for each
subdir@lettercombination - Downloads artifact metadata and extracts PyPI mappings
- Uploads individual package mappings to
hash-v0/{sha256}
- Runs in parallel for each
-
Merger Stage (
updater_of_index):- Combines all partial indices into a master index
- Uploads consolidated index to
hash-v0/{channel}/index.json
-
Relations Generation Stage (
update_relations_table) - NEW:- Runs after the merger stage completes
- Reads the updated index and generates relations table
- Uploads to
relations-v1/{channel}/paths - Only present in new workflows with relations support
-
Commit Stage (
update_file):- Updates local git repository files
- Runs mapping transformations (
update-mapping-legacy,update-mapping) - Commits compressed mappings to version control
The new workflows with relations support do NOT overwrite or interfere with old data:
- Same bucket, different prefixes: Both old and new workflows use
R2_PREFIX_BUCKET, but write to isolated path prefixes - v0 paths: Both systems continue to write hash-based mappings (backward compatible)
- v1 paths: Only new workflows write relations data (additive, no conflicts)
- No destructive operations: New workflows add functionality without removing or replacing existing data
This architecture allows for:
- Zero-downtime deployment of relations features
- Gradual migration from v0 to v1 APIs
- Rollback capability if issues arise
- Parallel operation of both systems during transition
The RelationsTable is a normalized table that maps Conda packages to PyPI packages and vice versa. Think of it as a many-to-many relationship database.
The table stores pairs of related packages:
- Conda side: package name + version + build (e.g.,
numpy-1.26.4-py311h64a7726_0) - PyPI side: package name + version (e.g.,
numpy==1.26.4)
Each row in the table represents one relationship between a specific conda build and a PyPI package version.
Conda → PyPI (hash-based):
- Given a conda package hash, find which PyPI packages it contains
- Location:
hash-v0/{sha256} - Example:
numpy-1.26.4-py311h64a7726_0→numpy==1.26.4
PyPI → Conda (aggregated files):
- Given a PyPI package name, find all available conda versions
- Location:
pypi-to-conda-v1/{channel}/{pypi_name}.json - Example:
requests→ all conda packages containing requests
{
"pypi_name": "requests",
"conda_versions": {
"2.31.0": ["requests", "jupyter-sphinx"],
"2.32.3": ["requests"]
}
}-
Many-to-One (common): Multiple conda builds for one PyPI version
- Example:
numpy-1.26.4-py311h...andnumpy-1.26.4-py310h...both map tonumpy==1.26.4
- Example:
-
One-to-Many (vendoring): One conda package contains multiple PyPI packages
- Example:
arm_pyartvendorsrequests, creating two mappings from one conda build
- Example:
-
Many-to-Many: A PyPI version appears in multiple conda packages
- Example:
requests==2.31.0is in bothrequestsandjupyter-sphinxconda packages
- Example:
The table is stored as JSONL (JSON Lines) with gzip compression:
# Each line is one relation
{"conda_name": "numpy", "conda_version": "1.26.4", "conda_build": "py311h64a7726_0",
"pypi_name": "numpy", "pypi_version": "1.26.4", "channel": "conda-forge"}Benefits:
- Each relationship stored exactly once (no duplication)
- Can query in either direction
- Incremental updates are simple
- Compact: ~10-30 MB compressed for conda-forge
- Total relationships: ~1.5 million
- Unique conda packages: ~1.4 million
- Unique PyPI packages: ~18,000
The ratio (~1.07 relationships per conda package) shows that most conda packages map to a single PyPI package, with occasional vendoring creating the extra relationships.
Developed with ❤️ at prefix.dev.