Quality Assurance

FAQ: How do you assure the quality of your categorization results?

The quality of every categorisation is measured by every single category as some categories are more important than others (e.g. salary detection is much more difficult but also much more important for our customers than payments in supermarkets).

For each category we measure the classical precision and recall KPIs and compute the F1 score which is a balanced average of recall and precision. “Recall” - roughly speaking - measures the capability of the machine to detect the desired category (“how many of the expected category occurrences are detected”), while “precision” measures the accuracy of the detected occurrences.

For instance, typical F1 scores in January 2020 were:

1 Immediate Support

Every customer can send direct feedback to the categorization team which takes care of the request. Most requests are of explanatory nature, some are problem reports that the team directly takes care of.

2 Random sample testing

Our categorization team does random sampling checks of categorized turnovers and sets the desired outcome. Several employees have to come to the same conclusion to settle a manual categorization decision (majority decision). This process looks at thousands of account turnovers every month and the results are processed in a 30-days-round-robin process.

3 Computing statistics

The manual majority decisions are used to compute recall, precision and F1 score for every category. This helps us to measure the quality of our engine and to steer our work.

4 Four-eyes-principle

As in our software development department, changes of the categorization logic cannot be done by only one person. A second expert has to check the changes in the categorization engine and accept them as useful and valid.

5 Automated tests

Every change in the categorization engine has to pass an automated test. If this automated test fails, the change cannot go productive.

6 Before-after-comparison

Every change in the categorization engine has to be checked by a team member in a before-after-comparison manner, meaning that the “before” engine and “after” engine are run on a big sample of real account turnovers and changes are displayed to be checked by the expert.

7 Versioning

Every version of the categorization engine is strictly versioned like any other piece of software that we are producing. If there should be a problem in a new version, we can return to a previous version any time and within seconds.

8 Problem case discussion

The categorization team does weekly or bi-weekly phone meetups to discuss the handling of edge and problem cases. Every edge or problem case is documented.

9 Team agreement measurement

All manual categorization decisions of all team members are monitored in a way that we can detect deviations from the “decision norm”. Comparing a vast number of manual decisions done by several team members shows individual misunderstanding or training needs.

PreviousThe Categorization Engine NextB2C & B2B Rating Checks

Last updated 5 years ago

Was this helpful?