How Anglerfish Works

    We audit training datasets for compliance risk without ingesting raw data—comparing vector representations against curated corpora to produce a reproducible audit trail.

    Your Training DataINPUT
    training_corpus.jsonl
    scraped_articles.txt
    image_captions.csv
    Potential copyright risk
    Anglerfish
    Compliance ReportOUTPUT
    Documents scanned847,293
    Matches found23
    Risk levelLow
    Audit IDaud_7x9k2m4p
    Verified
    Rights Cleared
    Ready

    The Audit Process

    Four steps from raw data to compliance-grade documentation.

    01

    Scope

    Define your dataset—domain, size, and intended use. We determine relevant indices and plausible risks.

    02

    Vectorise

    Data is processed locally. Only vectors leave your system—never raw text or proprietary content.

    03

    Analyse

    We compare vectorised data against public and custom corpora using tested semantic similarity techniques, then apply proprietary analysis to prioritise meaningful risk.

    04

    Report

    Receive a reproducible compliance pack with provenance, timestamps, and opt-out signals.

    Similarity ≠ infringement · Similarity = signal that requires review

    What Anglerfish is

    A compliance and risk-reduction layer

    A due-diligence system for training data

    Infrastructure for auditability and provenance

    Ready to run your first audit?

    Get started with our guided onboarding process.