Check your software.

Find the software you use. See if it’s approved.

1,190

Must be removed

1,708

Requires a license

600

Check with Sysadmin

33

Use browser tab

282

No action required

Sysadmin published the list.
The full list is searchable.

Why this policy exists

Unapproved software increases organizational risk. Compromised endpoints affect everyone on the network. Removing off-policy installs is necessary to keep systems within IT control.

When things go wrong

SolarWinds

2020: Orion was compromised at the source and ran with full network access across thousands of organizations.

Log4Shell

2021: Log4j was embedded in thousands of products. Most organizations had no idea they were running it.

Samsung + ChatGPT

2023: Engineers leaked source code through ChatGPT prompts. No AI tool policy, no visibility into what was running.

Pipeline & methodology

Full audit pipeline — figures below are computed from the dataset this page loads.

01 Source Enterprise install export — one row per detected install (software name + manufacturer) —

02 Normalize Deterministic rules — version tails, whitespace, legal-entity suffixes, stable vendor keys

03 Deduplicate Stable group_key per product + publisher · merged rows · install counts retained —

04 Enrich A Gemini 2.0 Flash — structured identity pass (vendor, license, category, canonical product URL) —

05 Enrich B Same model — policy, risk, user-facing guidance, alternatives, calibrated confidence —

06 Validate Enum + schema validation · invalid values coerced or dropped · sysadmin rulings merged (authoritative) —

07 Confidence 0.80 threshold applied · —

08 Ship Merged artifact · slim JSON for this UI · provenance preserved in repo —

Step 1 — Ingestion & normalization

Source of truth: Commercial Software to be removed.xlsx — IT Operations export, — rows. Each row is a single install record (software name + manufacturer). Versions are embedded in the name string; there is no separate version column and no per-machine identifier — this dataset is scoped to what is installed, not where.

Normalization (deterministic, repeatable):

Case-insensitive matching; internal whitespace collapsed
Version and build tails stripped with explicit regex (trailing numeric patterns and parenthetical suffixes)
Manufacturer legal suffixes normalized (Inc., Incorporated, Corp., Ltd., LLC, GmbH, etc.)
group_key = normalized product name + || + normalized vendor key — same key drives deduplication, enrichment cache, and merge
UI displays canonical vendor labels for major publishers for scanability; the underlying key stays stable for engineering

Scoped tradeoff: Aggressive version stripping can fold distinct product lines into one group when the source name collapses to the same base (e.g. major language versions). That is intentional for an org-wide inventory view; line-item policy belongs in IT systems of record, not in this rollup.

Step 2 — Deduplication

All — rows roll up into groups on group_key. Colliding rows merge; install counts aggregate so prevalence is visible. Where the source showed multiple manufacturer strings for one product, the pipeline picks a consistent display string (longest form wins at ingest).

Shape of the data: — distinct product groups. Collapse: — of rows folded into an existing group (— rows merged). Largest single group: — install rows. Single-install products: — of — groups (—%) — typical long tail of one-off tools.

Quality bar: Duplicate keys from messy manufacturer strings are possible; spot-checks during build suggested a sub-1% phantom duplicate rate. If two products collide, they are usually near-duplicates in practice.

Step 3 — Identity enrichment (Pass 1)

— structured Gemini 2.0 Flash calls — one per product group — to fill a fixed identity schema the source spreadsheet does not carry:

Canonical vendor — aligned to the real publisher, not the noisy string from the export
License type — closed enum: Commercial / Open Source / Free / Unknown
Category — 11-way taxonomy so the catalog is filterable and comparable
Product URL — primary product page where a user can verify what the title refers to

Every response is keyed by group_key and written to data/enrichment_cache.jsonl (append-only). The cache is the audit trail: reruns are idempotent — completed groups are never re-sent unless you intentionally invalidate the cache.

Operational limits: Model knowledge has a cutoff; very new or renamed SKUs may be mislabeled until refreshed. URLs are not live-checked in batch. Categories are mutually exclusive — hybrid products get the closest fit. Those limits are ordinary for LLM-assisted enrichment; the UI still surfaces source row text for audit.

Step 4 — Policy enrichment (Pass 2)

Second structured pass: — calls on the same model stack, separate prompts — policy and risk are intentionally isolated from identity so the model cannot shortcut from a vendor name to a verdict.

Policy status — approved / remove / license_required / use_web_version / review_with_sysadmin
Risk tier — low through critical (misuse / compromise exposure framing)
Description & recommended action — employee-readable, one screen each
Data sensitivity — what classes of organizational data the product can touch
Alternatives — Zoho or no-cost options where they exist
pass2_confidence — model-reported certainty on the policy label (continuous 0–1), used only for disclosure bands — not as ground truth

Splitting passes is a deliberate architecture choice: combined identity+policy prompts empirically produced anchor bias on the first label. Two narrow contracts per group produce more stable policy output than one kitchen-sink prompt.

Where AI stops: This pass infers posture from public product knowledge — it does not read internal IT policy documents. risk_tier is judgment, not a formal risk register score. Treat pass2_confidence as self-assessment, not verification.

Policy distribution in this dataset

Status	What it means	Count
Remove	Out of policy or high risk — or explicit sysadmin removal	—
License required	Commercial title — no evidence of a company license on record	—
Check sysadmin	Context-dependent — or model confidence below the disclosure threshold	—
Use web version	Browser / SaaS path acceptable; local install is not	—
Approved	Cleared by sysadmin ruling or high-confidence model pass	—

—

Step 5 — Authoritative overrides

Sysadmin Community rulings (sourced from Kirijan J's authoritative post for this audit) are codified in hardcoded_rulings.json — pattern + explicit field values. They apply after both enrichment passes, on normalized product names (word-boundary safe).

Overrides win: any field set by a ruling replaces model output for that field. Affected rows carry data_source: sysadmin_ruling. Those rules expand to — rows in this dataset when multiple install names normalize to the same product.

Only — rows carry organizational authority from that channel. Everything else is model-assisted classification with the safeguards described above.

Step 6 — Merge, slim, ship

build_v2_data.py merges deduplicated groups, Pass 1, Pass 2, and overrides into data/software_audit_data_v2.json. Precedence is strict: sysadmin override > Pass 2 > Pass 1 > raw export. The client bundle is then slimmed (field provenance and heavy raw rows stripped) for fast load — full lineage stays in the repository.

Final dataset composition (loading…)

Confidence & disclosure

Pass 2 asks the model for a calibrated pass2_confidence score (0.0–1.0) on its own policy_status label — explicit prompt, same structure every row. That score is used for transparency, not as a quality guarantee.

0.80 cutoff — sampled manual review on a random slice showed that below 0.80, human reviewers disagreed with the model more often; above it, agreement was high. The UI labels rows below the line ai_inferred so employees know to double-check — including all — such rows in this dataset. Nothing is hidden; the banding is disclosure.

Confidence is not verification. It does not prove correctness, policy alignment, or that the product still exists. It only measures how assertive the model was about its own label.

Decision support, not a policy system of record.

Classifications and recommendations are model-assisted unless a row is marked sysadmin_ruling. Use this tool to understand what is on the network and to drive conversations — not as the sole basis for disciplinary or compliance decisions without human review.

When in doubt, Sysadmin. They own the authoritative interpretation of corporate policy.

Loading software list…