Skip to content

davidancor/c1v-id

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

c1v-id

Identity resolution for AI applications

PyPI version Python 3.10+ License: MIT

AI agents that interact with customers, CRMs, or any system of record face a critical decision point: is this person already in our system, or should we create a new record? Because both input and existing data are often messy, agents can confuse customer records, pollute data with duplicates, or deliver poor customer experiences.

c1v-id is an open-source identity resolution library that sits between the agent and the system of record, answering identity queries in milliseconds. It uses probabilistic record linkage with blocking strategies (~O(n) vs naive O(n²)), weighted multi-field scoring, and transitive clustering. Designed as a drop-in for LangChain agents, n8n workflows, and RAG pipelines. Zero ML dependencies. Configurable survivorship rules.

Use Cases

  • AI Agents: Check if a customer exists before creating a new record
  • CRM Deduplication: Merge duplicate contacts from multiple sources
  • Lead Routing: Match incoming leads to existing opportunities
  • Customer Support: Find customer context across fragmented records
  • Data Migration: Deduplicate when merging systems

vs. Enterprise CDPs (Segment, mParticle)

c1v-id Enterprise CDP
Cost Free $100K+/year
Data Location Your infrastructure Their cloud
Customization Full control Limited
Integration Any Python app Vendor lock-in

Enterprise CDPs solve identity as part of a larger platform. c1v-id gives you just the identity resolution piece to embed anywhere.

Core Concepts

Concept What It Does Why It Matters
Normalization Cleans emails, phones, names [email protected][email protected]
Blocking Groups likely matches Reduces O(n²) to ~O(n)
Scoring Calculates similarity Weighted fuzzy matching across fields
Clustering Groups transitive matches If A≈B and B≈C, then A∈C
Golden Records Merges duplicates Best value wins per survivorship rules

Installation

pip install c1v-id

Quick Start

Resolve duplicates in 10 lines of Python:

from c1v_id import IdentityResolver

resolver = IdentityResolver()

records = [
    {"email": "[email protected]", "name": "John Doe", "phone": "555-1234"},
    {"email": "[email protected]", "name": "J. Doe", "phone": "555-1234"},
    {"email": "[email protected]", "name": "Jane Smith"},
]

golden = resolver.resolve(records)
print(f"Input: {len(records)} records → Output: {len(golden)} golden records")
# Input: 3 records → Output: 2 golden records

Match Two Records

result = resolver.match(
    {"email": "[email protected]", "name": "John"},
    {"email": "[email protected]", "name": "Johnny"}
)

print(result.score)       # 0.97
print(result.decision)    # 'auto_merge'
print(result.matched_on)  # ['email', 'name']

Find Matches in Existing Data

incoming = {"email": "[email protected]", "name": "John"}
existing = [
    {"id": "1", "email": "[email protected]", "name": "John Doe"},
    {"id": "2", "email": "[email protected]", "name": "Jane Doe"},
]

matches = resolver.find_matches(incoming, existing)
# Returns best matches sorted by score

Custom Configuration

from c1v_id import IdentityResolver, ResolverConfig, Thresholds, Weights

config = ResolverConfig(
    thresholds=Thresholds(auto_merge=0.95, needs_review=0.8),
    weights=Weights(email=0.6, phone=0.3, name=0.1, address=0.0),
)

resolver = IdentityResolver(config=config)

Why c1v-id?

vs. Splink

c1v-id Splink
Hello World 10 lines 50+ lines
Target AI builders Data analysts
Setup pip install Spark/DuckDB config
ML Required No Optional
Use Case Real-time matching Batch analytics

Splink is powerful for large-scale data linkage projects with dedicated analysts. c1v-id is for developers who need identity resolution as a feature, not a project.

vs. dedupe

c1v-id dedupe
Maintenance Active Stale (2+ years)
Dependencies 3 (pandas, rapidfuzz, pyyaml) 10+
Learning Curve Minimal Requires training data
API Style resolve(records) Iterative labeling

dedupe requires interactive labeling to train a model. c1v-id works out of the box with sensible defaults.

Low-Level API

For custom pipelines, use the building blocks directly:

Normalization

from c1v_id import norm_email, norm_phone, norm_name

norm_email("[email protected]")  # '[email protected]'
norm_phone("(555) 123-4567")          # '5551234567'
norm_name("  JOHN   DOE  ")           # 'john doe'

Blocking

from c1v_id import email_domain_last4, phone_last7, make_blocks

email_domain_last4("[email protected]")  # 'gmail.com|john'
phone_last7("555-123-4567")           # '1234567'

blocks = make_blocks(df, ["email_domain_last4", "phone_last7"])

Clustering

from c1v_id import UnionFind

uf = UnionFind([1, 2, 3, 4, 5])
uf.union(1, 2)
uf.union(2, 3)
uf.find(1) == uf.find(3)  # True (transitive)
uf.get_clusters()         # {1: [1, 2, 3], 4: [4], 5: [5]}

Golden Records

from c1v_id import build_golden_records, SurvivorshipRule

rules = {
    "email": SurvivorshipRule.MOST_RECENT,
    "address": SurvivorshipRule.LONGEST,
    "first": SurvivorshipRule.FIRST_NON_NULL,
}

golden = build_golden_records(df, clusters, rules, source_priority=["crm", "web"])

License

MIT