How Conformly Works - Conformly.ai

Why this document exists

If you are an engineer evaluating Conformly, your first question is not “what does it do” — you can see that in fifteen minutes. Your first question is “can I trust what it says, and can I defend it in front of my customer?” This document answers that question. It is shorter than the formal Methodology guide and longer than the marketing copy. It explains the core ideas in plain language, shows you the trust mechanisms, and tells you honestly where the limits are. By the end you should know whether Conformly is a tool you would put your name next to.

The core idea in one paragraph

Conformly reads automotive engineering documents the way a senior compliance engineer would. It identifies the standards that apply (ASPICE, ISO 26262, ISO 21434, ISO 9001, IEC 61508), evaluates the document against the relevant clauses, and produces a structured gap report with severity ratings, suggested fixes, and verbatim evidence anchors that point to the exact passage in the document each finding refers to. Every score is reproducible. Every finding is auditable. Every analysis is logged.

The five things that make this trustworthy

Most AI compliance tools fail the trust test in one of five ways. We built Conformly with the assumption that we would be evaluated on each of them.

1. Findings are anchored to verbatim text

Every gap Conformly produces includes an evidence quote: a short verbatim span — five to twenty words — copied directly from your document. This is the single most important trust mechanism in the product. When the AI evaluates a clause, we explicitly instruct it to find the location in your document its reasoning is based on, copy the exact words, and never paraphrase. If it cannot find a confident anchor, it returns nothing — rather than fabricate one. The corresponding finding is then shown without a precise highlight, so you know the AI was reasoning about a section it couldn’t pinpoint, rather than being misled by a wrong-looking arrow. You can verify any finding by searching your document for the quoted span. If the words are there, the AI was reading your document correctly. If they’re not, you have caught a hallucination — and we want to know about it, because the system is built to make those visible rather than hide them.

2. Confidence is surfaced, not buried

Every finding carries a confidence score from the AI itself, reflecting how sure it is about its own conclusion. Findings with confidence below 0.6 are automatically routed through a second evaluation pass with expanded evidence — the system tries harder before it tells you anything. Findings that are still low-confidence after the second pass are shown with a red confidence badge so you know to review them carefully. Findings with high confidence get a green badge. The badge sits next to every finding in the explorer; you cannot miss it. This is the opposite of how most AI tools work. The default behavior in the industry is to show every output with the same visual weight, hiding the model’s uncertainty. We do the opposite: the output volume is the same, but the certainty signal is loud.

3. The same input always produces the same output (within a documented tolerance)

Every analysis Conformly runs is pinned to three reproducibility anchors:

Model version — the exact AI model SKU used to evaluate
Prompt version — a versioned identifier for the evaluation prompt template
Standards versions — the version of each standard definition file used

These three pins are stored on every analysis row in the database. Six months from now, an auditor can re-run any historical analysis at the same pins and confirm the result. The acceptance criterion is that re-runs match the historical compliance score within ±2 points. Anything larger triggers an alert in our monitoring. This matters because the alternative — “we just ran the AI and these are the findings” — is not defensible in a compliance context. An assessor will ask “could you reproduce this?” and the answer needs to be yes.

4. The scoring math is on one page

All scoring lives in a single source of truth: pure functions, no hidden weighting matrix, fully unit-tested. The formula is intentionally simple:

A document starts at 100. Each finding subtracts a fixed penalty based on severity: critical −15, high −8, medium −4, low −2. The score is capped between 0 and 100.

A reviewer can re-derive any Conformly score from a calculator and the gap list. There is no opacity. There is no proprietary weighting we can’t explain. If you disagree with a score, the disagreement is about the severity of a specific finding — not about a black box. For ASPICE evaluations, we additionally compute a Capability Level following ISO/IEC 33020. We use the conservative aggregation rule: a workspace’s capability level is the lowest capability achieved by any of its evaluated processes, not the average. This matches OEM assessor practice and prevents a single weak process from being masked.

5. Every state change is logged in an append-only audit trail

Conformly’s audit log lives in a separate database schema with insert-only permissions — application code cannot update or delete rows in the audit table even if it wanted to. The constraint is enforced at the database level, not at the application layer. The audit log records who did what, when, and from where. Authentication events. Document uploads (with the SHA-256 of the bytes). Analysis lifecycles (started, completed, failed, ran in degraded mode). Every gap status change with from-and-to values. Every report export with format and row count. Every administrative action. This is the answer to the question every external auditor will eventually ask: “show me the chain of custody for this finding from creation to closure.” The answer is a single SQL query against the audit log.

The pipeline, in plain language

When you upload a document and start an analysis, this is what happens behind the scenes. We omit the implementation details that would only matter to an engineer writing the next version; what’s here is what determines whether the output is trustworthy. Step 1 — The document is parsed. A vision-based document reader extracts the text, structure, tables, and page numbers. If the primary parser fails for any reason, the system falls through to two simpler text-based parsers. We always know which parser was used and we report it on the result, so a degraded parse is visible rather than hidden. Step 2 — The relevant standards are loaded. Definitions for ASPICE, ISO 26262, ISO 21434, ISO 9001, and IEC 61508 are pre-parsed and pinned to specific versions. Each clause has expected key practices, evidence keywords, and a description of what compliance looks like. Step 3 — The document is indexed for retrieval. The text is split into chunks and embedded into a vector index so the system can pull the most relevant passages for any given clause. If the embedding service is unavailable, the system runs in degraded mode — and you see a warning banner on every page of the report so you cannot accidentally submit a low-confidence analysis to a customer. Step 4 — Each clause is evaluated. For every clause in the standards you selected, the AI receives the clause description, the expected practices, and the most relevant chunks from your document. It returns a status (compliant, partial, non-compliant), a confidence score, the evidence it found, and any gaps — each with a verbatim quote anchor. Step 5 — Low-confidence findings are re-evaluated. Any clause that came back with confidence below 0.6 is automatically run a second time with expanded evidence. The system tries harder before settling on an uncertain finding. If the second pass is still low-confidence, the finding is flagged in the UI so you know to look closely. Step 6 — Cross-clause checks run. ASPICE has process dependencies: SYS.3 cannot be compliant if SYS.2 is non-compliant. The cross-validation step catches structural inconsistencies a clause-by-clause review would miss and flags them for review. Step 7 — For high-stakes clauses, specialist agents review the result. A safety auditor agent re-checks safety-related findings against ISO 26262 part numbers. A security auditor agent does the same for ISO 21434. A supervisor agent resolves conflicts between primary and specialist evaluations. This step runs only when explicitly requested, because it costs more compute. Step 8 — The report is assembled. Findings are aggregated, scores are computed using the formula above, recommendations are sorted by priority, and the audit log is written. The result is persisted and shown to you.

What we tell you when something went wrong

The pipeline above has fallback paths at almost every step. We do not pretend they don’t exist. When any fallback engages, you see a quality banner on every page of the report identifying which path was used. The banner explicitly says things like:

This analysis ran in degraded mode. The vector retrieval service was unavailable; the system used a simplified retrieval approach. Findings should be considered indicative rather than definitive. Re-run when service is restored.

You cannot accidentally submit a degraded report to a customer or auditor. The banner is unmissable, the badge persists in exports, and the audit log records that the analysis ran in fallback mode. This is the discipline that turns “AI compliance” from a marketing claim into something you can defend.

What we are honest about not having

Trust starts with naming the limits. The AI is not deterministic. Even with pinned models and pinned prompts and a low temperature setting, the same input can produce slightly different outputs across runs. We accept ±2 points of drift on the final compliance score; larger drift triggers an alert. If your team needs strict bit-identical reproducibility for a regulated submission, we are happy to talk about how to set that up — typically by running the analysis once and pinning the result. The evidence-quote anchor depends on the AI copying from your document correctly. Our target hit rate is at least 80% of findings landing on a real verbatim quote. Below that we tighten the prompt. When the AI cannot find a confident anchor, the finding is shown without a precise highlight rather than with a fabricated one. You can always tell the difference. ASPICE Capability Level is computed conservatively. A process is rated at the lowest capability level any of its indicators achieved. This is the same approach OEM assessors use and it is by design — averaging would mask a single weak indicator. Some teams find this surprising the first time they see it; the methodology guide explains the math. Cross-standard knowledge graph coverage is highest for ASPICE and ISO 26262. The mappings between ASPICE process IDs and ISO 26262 clause numbers are hand-curated and maintained by our team. ISO 9001 cross-references are sparser. We add new mappings as data, not as code, so coverage will improve over time. We do not score legal or contractual compliance. Conformly evaluates technical and process compliance against published technical standards. It does not evaluate country-specific regulations, contractual obligations to a specific customer, or legal liability. Those dimensions require human judgment your team is responsible for.

Why this approach beats alternatives

The two ways most AI compliance tools fail are either by being too cautious (so they hedge everything and produce no actionable findings) or by being too confident (so they fabricate findings the engineer can’t defend). Both fail the engineer in different ways. Conformly takes a third approach: be confident when the evidence supports it, be visibly uncertain when it doesn’t, and never fabricate. The architecture is built around making trust signals load-bearing — confidence scores that affect re-evaluation logic, evidence anchors that are mandatory in the prompt, audit logs that the application cannot alter. The trust mechanisms are not bolted on at the end; they are wired into how the system thinks. The result is a tool an engineer can put their name next to. Not a tool that does the engineer’s job — a tool that does the boring 80% of the job in minutes, surfaces the 20% that needs human judgment, and gives the engineer a defensible audit trail for everything that happened along the way.

​Why this document exists

​The core idea in one paragraph

​The five things that make this trustworthy

​1. Findings are anchored to verbatim text

​2. Confidence is surfaced, not buried

​3. The same input always produces the same output (within a documented tolerance)

​4. The scoring math is on one page

​5. Every state change is logged in an append-only audit trail

​The pipeline, in plain language

​What we tell you when something went wrong

​What we are honest about not having

​Why this approach beats alternatives

​What to read next