Full Methodology

How the Bureau scores tools and classifies news

StackIndex scores AI tools for buyers choosing between them, producing a StackScore for each. It is not a voice-of-customer, sentiment, or review-aggregation platform.

This page covers the complete methodology in two parts. Jump to the section you need.

⚙️ Part 1 — StackScore Tools™⚡ Part 2 — StackScore News™

Part 1 of 2

StackScore Tools™

Every AI tool is scored across 4 intelligence layers by a team of specialist agents. No single agent decides. Rank applies a fixed formula and Pulse audits every run before a score goes live.

The Formula

Rank is the only agent that writes the final stackscore. The weights below are fixed — no agent can override them.

stackscore = ROUND(
  operational_score    × 0.40   // 40% — Can it improve real workflows?
  trust_score          × 0.25   // 25% — Can it be trusted operationally?
  market_score         × 0.20   // 20% — Does it matter in the ecosystem?
  infrastructure_score × 0.15   // 15% — Can it anchor a durable AI stack?
)

Operational40%

Does it work in real workflows?

Trust25%

Can you trust it with real data?

Market20%

Does it matter in the ecosystem?

Infrastructure15%

Can it anchor a durable stack?

News Momentum

The four dimensions above form the base StackScore — a stable rating that only changes when Rank re-evaluates a tool. On top of that, Sage applies a small, bounded momentum layer so the published score reflects what just happened in the news.

published_stackscore = CLAMP(base_stackscore + momentum, 0, 100)

momentum = CLAMP( ROUND(prev_momentum × 0.6 + impulse), −5, +5 )

impulse  = Σ  signed_impact(story → tool)   // over the last ~26h of news
           signed_impact ∈ [−2 … +2]         // −2 clearly bad … +2 major win

Each AI news story is classified for its direction on every tool it genuinely affects — a lawsuit, outage or pricing backlash pushes a tool down; a major launch, funding round or marquee customer pushes it up. Momentum is capped at ±5 points and decays ~40% per day, so a single story fades within a few days and can never override the underlying rating. On a quiet news day, momentum simply drifts back toward zero — movement is only ever shown when it is real. This is what powers the homepage Biggest Movers / Under Pressure board and the live ticker.

Confidence Formula

Every score has a confidence value (0–1). Low confidence = fewer sources, high variance, or missing data. Rank cannot inflate confidence — it can only cap it down.

base = average(operational_conf, trust_conf, market_conf, infra_conf)

penalties:
  −0.10  if ANY dimension evidence_count < 3
  −0.15  if score spread (max_dim − min_dim) > 35
  −0.05  if ANY dimension confidence < 0.60
  −0.08  if total evidence_count < 10

bonuses:
  +0.05  if ALL dimension evidence_count >= 8
  +0.03  if ALL dimension confidence >= 0.75

floor: 0.40   ceiling: 0.97   (cannot exceed 0.90 unless evidence_count ≥ 12)

Dynamic Indicators

Rank sets these badges on each tool after every evaluation. They appear on tool pages.

rising_momentumScore ≥ 5 points higher than previous entry

reliability_decliningScore ≥ 5 points lower than previous entry

hype_riskTrust score is 20+ points below operational score

enterprise_breakoutTrust ≥ 85 AND Infrastructure ≥ 85

new_entryNo prior row in stackscore_history for this tool

verifiedConfidence ≥ 0.85 AND total evidence_count ≥ 10

evidence_gapTotal evidence_count < 6 across all dimensions

DIM 1Operational Intelligence

40%QUILL

Core question: Can this tool improve real workflows?

1.1Core Task Utility30% of dimension

85–100Core capability confirmed in 8+ independent reviews matching primary use case. No major caveats.

65–84Core capability present and working. Consistent but non-blocking caveats in reviews.

40–64Mixed signal. Capability exists but frequently falls short of product page claims.

0–39Core capability absent, broken, or failing in majority of reviews.

Sources: Product homepage, G2 reviews (min 10)

1.2Workflow Integration Depth25% of dimension

85–10010+ native integrations AND listed in Zapier or Make AND public API documented

65–845–9 native integrations OR API-accessible with clear documentation

40–642–4 integrations OR API in beta or minimally documented

0–39Standalone only. No integrations. No API.

Sources: Integrations page, Zapier/Make directory, G2 integrations tab

1.3Output Reliability25% of dimension

85–100Zero or near-zero reliability complaints across 10+ independent sources.

65–84Occasional issues documented but not dominant theme.

40–64Reliability issues in 20–40% of sources discussing output quality.

0–39Hallucination or serious inaccuracy failures are primary complaint.

Sources: G2, Reddit r/artificial, Hacker News, X (last 90 days)

1.4ROI Accessibility12% of dimension

85–100Meaningful free tier OR G2 value score ≥ 4.5/5

65–84Paid only, G2 value 4.0–4.4, value defended in majority of reviews

40–64Expensive relative to alternatives per reviews, G2 value 3.5–3.9

0–39No pricing transparency OR G2 value < 3.5 OR price cited as dealbreaker

1.5Learning Curve8% of dimension

85–100G2 ease ≥ 4.5 AND "up in minutes" confirmed in multiple reviews.

65–84G2 ease 4.0–4.4. Some setup complexity but manageable.

40–64Significant learning curve per reviews.

0–39G2 ease < 3.5 OR no onboarding docs OR complexity is a primary barrier.

AUTOMATIC PENALTIES

G2 overall rating < 3.0−15 pts

Fewer than 5 G2 reviews exist−10 pts

No documented feature list on product page−8 pts

No demo, tour, or product video findable−5 pts

Evidence minimum: ≥5 sources including ≥3 independent user reviews. If not met → confidence capped at 0.60.

DIM 2Trust Intelligence

25%FORGE

Core question: Can this tool be trusted operationally?Owners: Forge (2.1, 2.2, 2.4, 2.5) · Quill (2.3)

2.1Data Privacy Posture30% of dimension

85–100Explicit opt-out from AI training. GDPR + DPA available. No data selling.

65–84Privacy policy present and readable. GDPR mentioned. Ambiguity on training data.

40–64Policy exists but vague on training data. No opt-out mentioned.

0–39No privacy policy OR policy explicitly allows training with no opt-out.

Hard rule: If privacy policy cannot be fetched → Trust score cannot exceed 50.

2.2Security Certification25% of dimension

85–100SOC 2 Type II confirmed AND one additional cert (ISO 27001, HIPAA, FedRAMP)

65–84SOC 2 Type II confirmed OR SOC 2 Type I with detailed security page

40–64Security page with claims but no third-party certification confirmed

0–39No security documentation found

2.3Output Accuracy / Hallucination Rate25% of dimension

85–100No hallucination or accuracy complaints across 10+ reviews. Accuracy specifically praised.

65–84Occasional accuracy issues, not dominant theme.

40–64Accuracy issues in 20–40% of output-quality reviews.

0–39Hallucination failures are primary complaint OR benchmarks contradict company claims.

2.4Company / Operational Stability12% of dimension

85–100Series B+ from recognizable VCs in last 18 months OR profitable public company

65–84Series A OR tier-1 seed in last 24 months with active hiring

40–64Bootstrapped with revenue signals OR small raise >24 months ago

0–39Recent layoffs OR pivot signals OR funding runway appears exhausted

2.5Incident Transparency8% of dimension

85–100Public status page with 12+ months of history. Transparent postmortems.

65–84Status page exists with limited history. No known major incidents.

40–64No status page but no known incidents in search.

0–39Confirmed security breach in last 24 months with poor public response.

AUTOMATIC PENALTIES

Confirmed data breach in last 24 months−20 pts

User data used for training, no opt-out−15 pts

No privacy policy found−10 pts

No security page of any kind−8 pts

DIM 3Market Intelligence

20%SCOUT

Core question: Does this tool matter in the AI ecosystem right now?Owners: Scout (3.1, 3.2, 3.3) · Flash (3.4)

3.1Adoption Velocity35% of dimension

85–100G2 reviews growing >20% QoQ OR 500+ total reviews with active recent posting

65–84Steady growth 5–20% QoQ OR 100–499 G2 reviews with recent activity

40–64Flat or slow growth. 20–99 G2 reviews. Limited community presence.

0–39Declining review activity OR fewer than 20 G2 reviews total

3.2Funding and Investment Signal30% of dimension

85–100$10M+ from recognizable VCs (a16z, Sequoia, YC) in last 18 months OR profitable public company

65–84$1M–$10M raised OR tier-1 seed in last 24 months

40–64Bootstrapped with revenue signals OR small raise >24 months ago

0–39No funding data, no revenue signals, or last raise >36 months ago

3.3Ecosystem Integration Signals20% of dimension

85–100Listed in 2+ major platform marketplaces AND named recognizable enterprise customers

65–841 major marketplace listing OR 3+ recognizable partner logos

40–64Partners page exists, companies unrecognizable or small

0–39No marketplace, no partners, no enterprise customer signals

3.4Narrative Quality / Signal vs Hype15% of dimension

85–100Tier-1 tech press with analytical substance. Active technical blog. Social is informative.

65–84Consistent press coverage, some analytical pieces. Blog active.

40–64Primarily promotional. Press = company-issued only.

0–39Hype-dominant with unverifiable superlatives. No independent coverage.

FLASH HYPE TRIGGERS

Unverifiable superlatives · Benchmark claims without methodology links · Multiple 5-star G2 reviews posted same day · Press that reads as paid placement
→ hype_score > 70 = automatic −10 pts on market score

AUTOMATIC PENALTIES

Last funding round >36 months ago, no revenue signals−15 pts

Flash hype_score > 70−10 pts

G2 review count declined QoQ−8 pts

Company blog not updated in >6 months−5 pts

DIM 4Infrastructure Intelligence

15%FORGE

Core question: Can this tool become part of a durable AI operating stack?

4.1API Maturity30% of dimension

85–100Versioned API (v1+), complete auth docs, rate limits with numbers, OpenAPI spec downloadable

65–84API versioned and documented with good auth docs, some gaps in reference

40–64API exists but unversioned, rate limits absent, or auth docs incomplete

0–39No public API OR API exists with no documentation

4.2Development Activity25% of dimension

85–100GitHub commits in last 30 days AND changelog updated in last 60 days

65–84Commits in last 90 days. Changelog updated in last 90 days.

40–64Last activity 91–180 days ago.

0–390 commits or changelog in 180+ days OR no public repository

4.3SDK and Developer Experience20% of dimension

85–100Official SDKs for Python AND JavaScript. Code examples on every major docs page. Quickstart < 10 minutes.

65–84SDK for 1 language + REST API with clear code examples

40–64REST API only. Minimal code examples. No official SDK.

0–39No SDK. No code examples. Docs describe endpoints in prose only.

4.4Orchestration Readiness15% of dimension

85–100Webhooks fully documented WITH streaming/async API AND ≥1 AI framework integration (LangChain, LlamaIndex, or MCP)

65–84Webhooks documented OR streaming API supported.

40–64Polling-only. No webhooks. No streaming.

0–39No async support. No orchestration path viable.

4.5Platform Durability10% of dimension

85–10099.9%+ SLA documented. Status page with 12+ months clean history. Deprecation policy stated.

65–84SLA mentioned OR status page with 6+ months clean history.

40–64No SLA but no known outage history.

0–39Known significant outages in last 12 months OR breaking API changes with no advance notice.

AUTOMATIC PENALTIES

No public API exists at all−20 pts

GitHub shows 0 commits in last 180 days−15 pts

No changelog ever published−10 pts

Rate limits completely undocumented−8 pts

API docs not updated in 12+ months−8 pts

Agent Execution Sequence

9 agents run in order for every tool evaluation — Insta directs, 8 specialists execute. Each step writes to agent_runs.

1.scoutDiscovers tool, fetches metadata, seeds evidence_sources

2.forgeFetches API docs + GitHub, scores infrastructure + trust

3.quillFetches reviews + pricing, scores operational + trust accuracy

4.flashScores market narrative quality, runs hype detection

5.scoutScores market adoption + funding + ecosystem signals

6.rankReads all dimension scores, computes composite, sets confidence, writes history

7.pulseValidates run integrity, checks for anomalies, writes governance report

8.instaSynthesises bureau notes for tool page display (max 3, 1 sentence each)

Continue to Part 2 — StackScore News™ ↓

Part 2 of 2

StackScore News™

Every AI news article is classified by Flash before it appears on the site. Flash must fetch the full article, find corroborating sources, and assign a credibility label. Flash cannot score from a headline alone.

What Flash Must Do Before Scoring

01Fetch the full article text at the source URL — not the title or snippet alone

02Find 1–2 corroborating sources via web search to confirm the claim

03Check prior coverage — was this claim previously reported as speculation vs confirmed?

The 5 Narrative Labels

Flash assigns exactly one label to every article. The label appears as a badge on the news feed and on each article page.

VERIFIED

credibility_score ≥ 85 AND secondary source confirmed

Multiple independent primary sources corroborate the claim. High author and publication credibility.

LIKELY

credibility_score 65–84, limited secondary confirmation

Credible source with strong track record. Secondary confirmation exists but is limited.

SPECULATIVE

credibility_score 40–64, primary source unconfirmed

Claim is plausible but unverified. Primary source lacks independent corroboration.

PROMOTIONAL

Source is company-originated, no independent verification

Content originated from the company itself. No independent third-party verification found.

HYPE ALERT

credibility_score < 40 OR hype patterns detected

Signal-to-hype ratio is critically low. Unverifiable claims, coordinated amplification, or fabricated benchmarks detected. Do not amplify.

7 Dimensions Scored Per Article (each 0–100)

Flash scores every article across 7 dimensions. Credibility is the primary driver of the narrative label; Signal-to-Hype Ratio captures the ratio of operational evidence to promotional language. All 7 scores are stored and displayed on article pages.

Credibilitycredibility_score

Primary source quality, author track record, publication tier. This is the main driver of the narrative label.

Signal-to-Hype Ratiosignal_to_hype_ratio

Ratio of operational evidence to promotional language. High score = evidence-dominant, reproducible, sourced. Low score = unverifiable superlatives or coordinated amplification.

Enterprise Relevanceenterprise_relevance

How meaningful is this to enterprise or professional AI adoption? Not just developer curiosity.

Infrastructure Impactinfrastructure_impact

Does this change how AI systems are built or integrated? High score = architects need to know this.

Workflow Disruptionworkflow_disruption

Does this materially change how workflows are executed today? Practical impact, not theoretical.

Narrative Longevitynarrative_longevity

Will this matter in 3 months? Or just 3 days? Low score = ephemeral news cycle story.

Ecosystem Velocityecosystem_velocity

Does this accelerate or decelerate a category trend? High score = it changes the trajectory of a space.

Credibility Score → Label Mapping

85–100VERIFIEDAND secondary source confirmed

65–84LIKELYlimited secondary confirmation

40–64SPECULATIVEprimary source unconfirmed

< 40HYPE ALERTOR hype patterns detected

anyPROMOTIONALcompany-originated content, score overridden

↑ Back to Tools methodology ← StackIndex™ overview

See StackIndex™ in action

Browse tools scored by the engine, or read classified news.

Browse Tools →AI News →