general

Why are different tokenizers used in the rag? Are empty strings in token list meaningful for searching tantivy?

I noticed that different tokenizers are used in the rag and I'm wondering why. Also, are the empty strings in the token list meaningful for searching tantivy?

Ry

Ryan Y

Asked on Jan 29, 2024

The different tokenizers are used in the rag to handle different types of input data. For example, one tokenizer may be used for text data while another tokenizer may be used for numeric data. As for the empty strings in the token list, they are not meaningful for searching tantivy and can be removed for easier debugging experience.

Jan 29, 2024Edited by