Extraction Pipeline User Guide

How to extract and normalise key terms from short verbatim responses using TruVerbatim’s Entity Extraction pipeline.

Overview

The Extraction Pipeline is designed for responses that mention specific things – brands, products, features, places, or other named entities. Instead of clustering responses by meaning, the pipeline reads each response and pulls out every entity mentioned, corrects spelling variations, and builds a clean frequency-ranked codeframe.

Each response is assigned:

  • One or more extracted entities (e.g. “Nike”, “Adidas”)
  • A raw mention preserving the original text before normalisation

Step-by-Step Guide

Step 1: Upload Your Data

  1. Open TruVerbatim and sign in
  2. Drag and drop your CSV or Excel file onto the upload area in the chat
  3. Select the column containing your verbatim responses

Optional: Enable auto-cleaning to remove personal information, profanity, duplicates, and blank rows.

Step 2: Review the Recommendation

TruVerbatim analyses your data and presents a triage recommendation. If your responses are short or sparse, the system will recommend Key Term Extraction as the best approach.

You will see:

  • A recommendation card with a confidence score
  • A brief explanation of why extraction was recommended
  • Key data metrics (median response length, short response rate, vocabulary diversity)

If the recommendation shows “Thematic Analysis” but you know your data contains entity mentions, you can override and select “Key Term Extraction” instead.

Step 3: Select Key Term Extraction

Click the Key Term Extraction pipeline card. The analysis begins immediately.

Step 4: Watch the Analysis Run

Real-time progress updates appear in the chat:

  1. Starting entity extraction pipeline – data is loaded and validated
  2. Understanding your data – the system samples your responses to understand the domain (e.g. brands, products, places)
  3. Extracting entities – the AI processes your responses in batches, with a progress percentage updating as it goes
  4. Building codeframe – extracted entities are counted and ranked by frequency
  5. Generating insight – the AI writes a brief summary of the findings

A progress bar shows the percentage complete throughout.

Step 5: View Your Results

When the extraction completes, the chat displays:

  1. Entity frequency chart – a bar chart showing your extracted entities ranked by how often they were mentioned
  2. AI-generated insight – a narrative summary of the top entities and their distribution
  3. Download button – click to download the full classified CSV

Step 6: Download Your Results

Click the Download CSV button. The exported file includes:

ColumnDescription
verbatim_idRow identifier
verbatim_textThe original response
EXTRACTED_ENTITYAll normalised entities found (semicolon-separated)
RAW_MENTIONThe original text before normalisation
THEMEPrimary entity
Original columnsAll metadata from your uploaded file

How It Works

Domain Understanding

Before processing your full dataset, the pipeline samples a selection of your responses to understand the domain. This helps the AI recognise whether responses are about brands, products, cities, features, or something else entirely – so it extracts the right type of entity.

Entity Extraction

For each group of responses, the AI:

  1. Uses the domain context to understand what types of entities to expect
  2. Reads each response carefully
  3. Extracts every named entity mentioned
  4. Splits multi-entity responses (e.g. “Nike and Adidas” becomes two separate entities)
  5. Corrects typos and spelling variations while preserving meaning
  6. Assigns a confidence level (high, medium, or low)

Normalisation

The AI automatically normalises variations:

  • “Nike”, “nike”, “NIKE”, “Nikee” all become “Nike”
  • “Customer Service”, “customer service”, “cust. service” are unified
  • The original mention is preserved in the RAW_MENTION column so you can always see what the respondent actually typed

Codeframe Building

Once extraction is complete, the system:

  1. Counts how many times each unique entity appears
  2. Ranks entities by frequency (most mentioned first)
  3. Calculates percentages of total responses
  4. Builds a codeframe compatible with the rest of TruVerbatim’s tools (Q&A, sentiment, PowerPoint)

After the Extraction

Ask Questions

Type questions in the chat to explore your results:

  • “What are the top 10 entities?”
  • “How many people mentioned Nike?”
  • “Show me a crosstab of entities by age group”
  • “Show me the verbatims that mentioned Adidas”

Handling Multi-Entity Responses

The extraction pipeline handles responses that mention multiple entities. For example:

ResponseExtracted entities
“Nike and Adidas”Nike; Adidas
“I bought shoes from Nike/Reebok”Nike; Reebok

Each entity is counted separately in the frequency chart. The CSV shows all entities in the EXTRACTED_ENTITY column (semicolon-separated) and the primary entity in the THEME column.

Troubleshooting

IssueLikely causeSolution
Entities not being splitUnusual separator in responsesThe AI handles “/”, “&”, “and”, and commas – other separators may need pre-processing
Too many variations of the same entityUnusual spellings or abbreviationsThe AI normalises most variations, but you can merge in post-processing
“Uncodeable” responsesBlank, gibberish, or truly non-extractable textThese are expected – review them to check they are genuinely non-responses
Unexpected entities extractedThe AI misinterpreted the domainThis can happen with ambiguous data. Try re-running the analysis
Analysis seems slowLarge dataset (5,000+ responses)Expected – the pipeline processes your data in stages. Progress updates show the current stage

Tips for Best Results

  • Short, specific responses work best – the pipeline is optimised for brand names, product mentions, and single concepts
  • Include context in the question – if your survey question mentions “brands”, the AI will focus on brand extraction
  • Review the “Uncodeable” items – they usually indicate blank or gibberish responses, but occasionally contain a valid entity the AI missed
  • Use mention rank filtering – if your data has grouped columns (brand_1, brand_2, brand_3), the chart will offer rank-based filtering to see first-choice vs second-choice mentions
  • Combine with Q&A – after extraction, use natural language questions to cross-tabulate entities against demographic variables

Scroll to Top