Monitors help you continuously evaluate and improve Fin’s conversation quality at scale. They give you a structured way to define which conversations should be reviewed—whether that’s a random sample for baseline quality, or a targeted set based on higher-risk or higher-impact signals. This replaces ad-hoc sampling and spreadsheet-driven QA with a repeatable system that scales as volume grows.
Monitors work with Custom Scorecards:
Monitors define what gets reviewed
Scorecards define how each conversation is evaluated
Scorecards can include criteria that are:
Reviewed manually
Evaluated using AI
Or a combination of both
This ensures quality is assessed consistently, while still allowing flexibility in how reviews are performed.
Note: Monitors is available as part of the Pro add-on. This feature is currently in closed beta. If you'd like access, you can request access via this form. We’re gradually expanding availability and will follow up if a spot opens up.
How teams use Monitors
Teams use Monitors to maintain ongoing visibility into quality and focus attention where it matters most. Common use cases include:
Reviewing a random sample to understand overall quality trends
Focusing on higher-risk or higher-impact conversations, such as:
Low CX scores
Policy breaches
Legal threats
Other business-specific indicators
Tracking conversations tied to a specific initiative, like a feature launch, pricing change, or product update
Monitors make it easier to detect patterns, surface issues earlier, and generate insights that can be shared with product, support, or leadership teams.
How to create a Monitor
To access Monitors, go to Analyze > Monitors.
To create a new Monitor, click + Monitor. Pick one of the templates or Start from scratch.
Choose conversations
Give your Monitor a name and choose which conversations the Monitor should review.
This can be:
A random sample (for example, a weekly sample of Fin conversations for baseline QA)
A targeted set based on specific signals or risk (for example, all conversations where a customer shows signs of financial vulnerability)
You can narrow down conversations of interest by:
Filtering by precise filters e.g. Resolution State, Topic, CX Score and more.
Using the flag criteria input, which uses natural language to describe the types of conversations you want flagged.
Note: A single conversation can appear in multiple monitors. Each monitor runs independently, so if a conversation matches more than one monitor's criteria, it will be flagged in each. Clicking through to a conversation shows exactly why it was flagged by that monitor.
Monitoring mode
When creating a monitor, you'll first choose how it runs:
Continuous: Runs ongoing — matches new conversations as they close and adds them to the monitor automatically.
One-time: Backfill only. Matches conversations from historical data — new conversations that close after setup won't be included.
Scheduled: Runs on a recurring daily or weekly cadence, letting teammates review conversations on a regular schedule.
Select the start date
Choose when the monitor should begin evaluating conversations. This allows you to run QA on historical conversations from a specific point in time, as well as continuously surface new matching conversations.
Choose when conversations are added to the monitor
You can control when a conversation is matched to a monitor. This determines when the monitor evaluates the conversation — and, if a scorecard is attached, when that scorecard runs. Select one of the following options:
Fin is done – Conversations are added once Fin has fully completed handling them (resolved, escalated, or followed up with no customer reply).
Conversation is closed – Conversations are added only after the conversation is closed, either by a teammate or by Fin.
Use this setting to align evaluation timing with your workflow — whether you want to assess Fin immediately after it finishes, or only once the conversation is officially closed.
Choose the reviewer
All conversations that match the Monitor are automatically assigned to that reviewer, so reviews are routed consistently without manual coordination.
In this example, we're creating a Monitor that flags conversations with "Vulnerable customers", starts finding matches from Now (today), and assigns them to Alissa for review.
Attach a scorecard (optional)
You can associate a scorecard with a monitor to automatically evaluate every matched conversation against defined criteria. Once selected, the scorecard runs as soon as the conversation is added to the monitor, and results appear in the monitor for reporting and review.
Test your monitor before turning it on
For monitors that use natural language flag criteria, use the Test monitor tool to validate your criteria against real conversations before you create or update the monitor. It shows which conversations would be flagged and highlights mismatches so you can refine the wording and reduce false positives or misses.
Tip: We strongly recommend testing every monitor with flag criteria before turning it on.
In the Flag criteria section, click Run test or click the Test button on the top right
Review sample conversations
For existing monitors, this list is automatically populated with recent conversations that were flagged and not flagged by the monitor. You can also paste additional conversation URLs or IDs to test specific edge cases.
Check the results
For each conversation, review the Monitor result (Flagged / Not flagged) and mark whether it’s Correct. The evaluation summary shows your overall pass rate and highlights mismatches.
Refine and retest
Update the Flag criteria description and rerun the test until the results accurately reflect what you want the monitor to capture.
Use the Refine wording button to let the AI automatically rephrase your flag criteria. This can help tighten the language and improve accuracy without having to rewrite criteria manually.
Once the Monitor has been created, it will start finding matches and appear on your Monitors page. You can always edit your Monitor configuration later, if needed.
Creating and configuring Scorecards
A custom scorecard defines what “good” looks like for your team by explicitly setting the criteria you care about—such as accuracy, tone, or policy adherence.
You can have multiple scorecards for different Monitors. Simply choose which scorecard you want to associate with a Monitor from the Monitor set up screen:
To create a scorecard
Go to Analyze > Monitors and click Scorecards in the top right. You can use the out-of-the box Fin Quality Scorecard or create your own by clicking +New scorecard.
Create a new scorecard attribute
Start by adding scorecard attributes. Click +New attribute
When creating a new attribute, you’ll:
1. Name the attribute. Give the attribute a short, clear name (for example, Sentiment or Answer accuracy). This name appears in reports and will be used as a reference.
2. Describe what’s being evaluated. Add a clear description explaining:
What the attribute checks
How it should be scored
This ensures consistent evaluation by both AI and human reviewers.
3. Choose how the attribute is scored. Decide whether the attribute should be:
Automatically scored with AI, or
Manually scored by human reviewers
You can mix AI-scored and human-scored attributes within the same scorecard.
Tip: Scorecard attributes are reusable. Once you've created an attribute, you can add it to multiple scorecards — no need to recreate it from scratch each time.
4. Define rating options. Add the possible rating values a reviewer or AI can select (for example: Good, Okay, Poor). Each attribute must have at least two rating options. For each rating option, you’ll:
Name the rating (short and clear)
Describe when it should be selected
Assign a score (for example, 100%, 50%, 0%) or mark it as Not scored
The score you assign determines how that rating contributes to the overall review score.
5. Choose whether to include it in the review score
You can toggle Include in review score on or off.
When enabled, this attribute contributes to the overall review score.
When disabled, the attribute is recorded for analysis and reporting, but does not affect the overall score.
Configure your scorecard
After adding scorecard attributes, you can configure how they affect the overall review result.
Marking a scorecard attribute as critical
You can mark an attribute as Critical. If a critical attribute receives a failing rating, the entire review fails:
The overall review score becomes 0%
This overrides all weights
“Not scored” ratings exclude the attribute from the overall score and do not trigger failure
Critical attributes are useful for non-negotiable standards such as:
Compliance requirements
Safety or policy adherence
Escalation handling
Scorecard attribute weighting
Each attribute can be assigned a weight to define its relative importance.
Weight must be an integer between 0 and 100
Higher weights increase the impact of that attribute on the overall review score
Weights only apply to attributes that are included in the review score. Use weights to reflect what matters most. For example, you might assign a higher weight to Accuracy than to Efficiency if correctness is more important than speed.
Adding a pass threshold
You can define a pass threshold for the scorecard. The pass threshold determines the minimum overall score required for a review to be considered passing. For example:
If the pass threshold is 80%, any review scoring below 80% will be marked as failed.
This is evaluated after weighted scoring, provided no critical attribute has already failed the review.
How the overall review score works
Each attribute is rated using its defined rating options.
Ratings contribute their assigned score (or are excluded if marked Not scored).
Included attributes are combined using their assigned weights.
If any critical attribute receives a failing rating, the overall review score becomes 0%.
The final score is compared against the pass threshold to determine whether the review passes or fails.
Where to view scores
Once reviews are completed, scores are visible in both the conversation list and within each conversation.
In Monitor, the conversation list shows the overall review score (percentage or Fail) alongside the individual attribute ratings as columns. This makes it easy to quickly scan performance across conversations and spot failures or low scores.
When you open a conversation and go to the Score tab, you can see the assigned scorecard, review status, overall score, and the selected rating for each attribute. This view shows exactly how the final score was determined.
Managing reviews
Each monitor gives you a clear view of the conversations it has matched and the scorecard scores. This makes it easy to move from detection to review to action, without leaving Intercom. Click on a Monitor and you can see:
All matched conversations in one place.
Review status (Unreviewed, Reviewed, Reviewed + fix needed, Reviewed + fixed, Reviewed + won't fix).
Any AI applied scores and where manual scoring is still needed.
Who is assigned as the reviewer.
Manual reviews can be completed directly from the Monitor conversation view by clicking on the conversation and filling in the scorecard. AI‑generated scores can also be overridden by human reviewers if needed.
QA review status labels
Review status labels use a consistent "Reviewed" prefix to clearly distinguish the review outcome from the action needed.
Label | What it means |
Unreviewed | No review has taken place yet |
Reviewed | Review complete, no action needed |
Reviewed + fix needed | Review complete, a fix is required |
Reviewed + won't fix | Review complete, issue acknowledged but won't be actioned |
Reviewed + fixed | Review complete, fix has been applied |
Best practices for writing Scorecard Attribute descriptions
Start with the core principle: Attributes compete. The AI looks at the full list and selects the single best match for each attribute. Your job is to make that choice obvious.
1. Use clear, concise names
Keep names short and specific. Someone reading the list should immediately understand the purpose without opening the description.
Bad: Customer Communication Issues
Better: Tone – Rude or Dismissive
2. Write comprehensive descriptions
Descriptions carry most of the classification signal.
Explicitly describe all conversation types that belong.
Include keywords, common phrasings, and examples.
Think through edge cases and include them.
Clarify what “good” and “bad” instances look like.
The description should make it easy for the AI to recognize real-world phrasing, not just abstract definitions.
3. Make attributes clearly distinct
Attributes within the same scorecard should not compete conceptually.
Avoid semantic overlap.
Ensure each attribute has a clear boundary.
If two attributes could reasonably apply for the same reason, refine one of them.
It’s fine if a single conversation fits multiple attributes across the scorecard. What matters is that within each attribute set, the values are clearly separable.
4. Evaluate quality systematically
When reviewing your taxonomy, assess each attribute on:
Clarity / Conciseness
Description Comprehensiveness
Attribute Distinction
Overlapping Attributes (if any)
Final Score + Commentary
This structured review forces you to tighten definitions and reduce ambiguity — which directly improves classification performance.
Best practices for writing Monitor (Flag) criteria
Monitors do not compete. Each monitor runs independently as a yes/no check. Multiple monitors can flag the same conversation — and that’s fine.
Because of this, precision matters more than distinction.
1. Describe observable behavior, not inferred intent
Focus on what appears in the conversation.
Avoid: “Customer is frustrated”
Prefer: “Customer uses phrases such as ‘This is unacceptable,’ ‘I’m extremely disappointed,’ or ‘This is ridiculous.’”
The AI performs better when evaluating explicit signals rather than emotional interpretations.
2. Include concrete examples
Examples dramatically reduce ambiguity.
Use explicit phrasing patterns: “e.g., ‘cancel my subscription,’ ‘close my account,’ ‘delete my data’”
Examples anchor the model to real-world language.
3. Add explicit exclusions
Reducing false positives is critical for monitors.
Example: “Customer uses profanity. EXCLUDE: mild language such as ‘damn’ or ‘crap.’” If something should not trigger the monitor, say so clearly.
4. Use quantifiable thresholds
Avoid vague wording.
Bad: “Fin gives a short response.”
Better: “Fin response is fewer than 50 words.”
Specific thresholds improve consistency.
5. Break multi-step logic into numbered criteria
If your monitor depends on sequence or pattern, structure it clearly:
Customer expresses frustration.
Fin responds without acknowledging emotion.
Customer repeats complaint.
This makes the logic deterministic and easier to evaluate.
6. Keep simple monitors simple
If the rule is straightforward, don’t overcomplicate it.
Example: “Fin suggests next steps (e.g., ‘Please try clearing your cache,’ ‘Log out and back in,’ ‘Click this link’).”
Clarity beats complexity.
Coming soon
We’re expanding Monitors with more powerful ways to detect issues, measure quality, and take action. Upcoming improvements include:
Structured sampling: Define a fixed sample size per Monitor, including recurring weekly or monthly samples, to support consistent QA scoring.
Advanced reporting: Filter reports by Monitor, scorecard, attributes, and scores directly within the reporting platform.
Balanced reviewer workload: Assign Monitors to multiple reviewers or teams to distribute manual review work evenly.
Scorecard rating transparency: See why an AI rating was applied to a scorecard attribute, or require reviewers to provide a reason for manual scores.
Out-of-the-box procedure monitors: Automatically track whether procedures are triggered and completed successfully, flagging connector failures, execution errors, user frustration signals, and escalation handling quality.
Real-time alerts: Get notified when conversations in a Monitor cross defined thresholds or fail a scorecard.
Pre-deployment scoring: Test changes in preview by evaluating conversations against scorecards before going live.
FAQs
If a conversation is added to a monitor and evaluated, what happens if it reopens later? Will it be evaluated again?
If a conversation is added to a monitor and evaluated, what happens if it reopens later? Will it be evaluated again?
No, a conversation is evaluated only once per monitor. Conversations are added to a monitor based on the setting you’ve selected in the monitor configration (for example, “Fin is done” or “Conversation is closed”). When the conversation reaches that state, it’s matched into the monitor and evaluated. If the conversation later reopens because the customer sends a new message, it won’t be re-matched or re-evaluated under the same monitor version. The original evaluation is the only one recorded.
Do monitors work for Fin Voice?
Do monitors work for Fin Voice?
No, at the moment Fin Voice is not supported.
Need more help? Get support from our Community Forum
Find answers and get help from Intercom Support and Community Experts

















