Extraction Pipeline User Guide -

How to extract and normalise key terms from short verbatim responses using TruVerbatim’s Entity Extraction pipeline.

Overview

The Extraction Pipeline is designed for responses that mention specific things – brands, products, features, places, or other named entities. Instead of clustering responses by meaning, the pipeline reads each response and pulls out every entity mentioned, corrects spelling variations, and builds a clean frequency-ranked codeframe.

Each response is assigned:

One or more extracted entities (e.g. “Nike”, “Adidas”)
A raw mention preserving the original text before normalisation

Step-by-Step Guide

Step 1: Upload Your Data

Open TruVerbatim and sign in
Drag and drop your CSV or Excel file onto the upload area in the chat
Select the column containing your verbatim responses

Optional: Enable auto-cleaning to remove personal information, profanity, duplicates, and blank rows.

Step 2: Review the Recommendation

TruVerbatim analyses your data and presents a triage recommendation. If your responses are short or sparse, the system will recommend Key Term Extraction as the best approach.

You will see:

A recommendation card with a confidence score
A brief explanation of why extraction was recommended
Key data metrics (median response length, short response rate, vocabulary diversity)

If the recommendation shows “Thematic Analysis” but you know your data contains entity mentions, you can override and select “Key Term Extraction” instead.

Step 3: Select Key Term Extraction

Click the Key Term Extraction pipeline card. The analysis begins immediately.

Step 4: Watch the Analysis Run

Real-time progress updates appear in the chat:

Starting entity extraction pipeline – data is loaded and validated
Understanding your data – the system samples your responses to understand the domain (e.g. brands, products, places)
Extracting entities – the AI processes your responses in batches, with a progress percentage updating as it goes
Building codeframe – extracted entities are counted and ranked by frequency
Generating insight – the AI writes a brief summary of the findings

A progress bar shows the percentage complete throughout.

Step 5: View Your Results

When the extraction completes, the chat displays:

Entity frequency chart – a bar chart showing your extracted entities ranked by how often they were mentioned
AI-generated insight – a narrative summary of the top entities and their distribution
Download button – click to download the full classified CSV

Step 6: Download Your Results

Click the Download CSV button. The exported file includes:

Column	Description
verbatim_id	Row identifier
verbatim_text	The original response
EXTRACTED_ENTITY	All normalised entities found (semicolon-separated)
RAW_MENTION	The original text before normalisation
THEME	Primary entity
Original columns	All metadata from your uploaded file

How It Works

Domain Understanding

Before processing your full dataset, the pipeline samples a selection of your responses to understand the domain. This helps the AI recognise whether responses are about brands, products, cities, features, or something else entirely – so it extracts the right type of entity.

Entity Extraction

For each group of responses, the AI:

Uses the domain context to understand what types of entities to expect
Reads each response carefully
Extracts every named entity mentioned
Splits multi-entity responses (e.g. “Nike and Adidas” becomes two separate entities)
Corrects typos and spelling variations while preserving meaning
Assigns a confidence level (high, medium, or low)

Normalisation

The AI automatically normalises variations:

“Nike”, “nike”, “NIKE”, “Nikee” all become “Nike”
“Customer Service”, “customer service”, “cust. service” are unified
The original mention is preserved in the RAW_MENTION column so you can always see what the respondent actually typed

Codeframe Building

Once extraction is complete, the system:

Counts how many times each unique entity appears
Ranks entities by frequency (most mentioned first)
Calculates percentages of total responses
Builds a codeframe compatible with the rest of TruVerbatim’s tools (Q&A, sentiment, PowerPoint)

After the Extraction

Ask Questions

Type questions in the chat to explore your results:

“What are the top 10 entities?”
“How many people mentioned Nike?”
“Show me a crosstab of entities by age group”
“Show me the verbatims that mentioned Adidas”

Handling Multi-Entity Responses

The extraction pipeline handles responses that mention multiple entities. For example:

Response	Extracted entities
“Nike and Adidas”	Nike; Adidas
“I bought shoes from Nike/Reebok”	Nike; Reebok

Each entity is counted separately in the frequency chart. The CSV shows all entities in the EXTRACTED_ENTITY column (semicolon-separated) and the primary entity in the THEME column.

Troubleshooting

Issue	Likely cause	Solution
Entities not being split	Unusual separator in responses	The AI handles “/”, “&”, “and”, and commas – other separators may need pre-processing
Too many variations of the same entity	Unusual spellings or abbreviations	The AI normalises most variations, but you can merge in post-processing
“Uncodeable” responses	Blank, gibberish, or truly non-extractable text	These are expected – review them to check they are genuinely non-responses
Unexpected entities extracted	The AI misinterpreted the domain	This can happen with ambiguous data. Try re-running the analysis
Analysis seems slow	Large dataset (5,000+ responses)	Expected – the pipeline processes your data in stages. Progress updates show the current stage

Tips for Best Results

Short, specific responses work best – the pipeline is optimised for brand names, product mentions, and single concepts
Include context in the question – if your survey question mentions “brands”, the AI will focus on brand extraction
Review the “Uncodeable” items – they usually indicate blank or gibberish responses, but occasionally contain a valid entity the AI missed
Use mention rank filtering – if your data has grouped columns (brand_1, brand_2, brand_3), the chart will offer rank-based filtering to see first-choice vs second-choice mentions
Combine with Q&A – after extraction, use natural language questions to cross-tabulate entities against demographic variables