Newest 'tokenize' Questions

Advice

0 votes

4 replies

109 views

Removing and inserting an element into a vector after doing another operation with it

I am creating a math parser using a tokenizing system, turning the user entered expression into Postfix/Reverse Polish Notation. I have set the function to turn a user-entered input into RPN, then I ...

Chetan Poudel

1

asked Apr 15 at 16:24

13 votes

6 answers

1k views

Splitting a string into tokens with several possible separators, using `std::ranges`

My goal is to split a std::string into tokens delimited by a list of possible delimiters/separators. For instance std::string line{"\tSplit \t\t this sequence\t of tokens "}; must be ...

Oersted

5,187

asked Feb 4 at 17:10

Advice

0 votes

0 replies

53 views

Using Langchain's MarkdownTextSplitter with a tokenizer

I'm not sure how to chunk a Markdown file with Langchain's MarkdownTextSplitter and at the same making sure the chunks don't overflow the maximum token size for the llm we will be using. As far as I ...

michielve

579

asked Jan 8 at 13:19

Advice

0 votes

0 replies

121 views

Does OpenAI API TPM limit count input tokens, output tokens, or both?

I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit. If I have, for example, 2 million TPM, is that limit calculated based on: only the input ...

Adabler

34

asked Nov 23, 2025 at 12:12

1 vote

2 answers

664 views

How can I match the token count used by BGE-M3 embedding model before embedding?

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...

ManBearPigeon

13

asked Sep 2, 2025 at 18:38

1 vote

0 answers

223 views

Convert SentencePiece tokenizer to ONNX

I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones. I'm using the ...

ltu

177

asked Aug 27, 2025 at 11:17

1 vote

2 answers

152 views

Strtok retains old data

I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...

The QNX girl

45

asked Aug 1, 2025 at 15:00

0 votes

1 answer

60 views

Efficient multi-host TPU dataset processing

I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...

innerproduct

3

asked Jul 10, 2025 at 21:35

2 votes

1 answer

1k views

How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?

I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed: ...

GauravGiri

21

asked May 22, 2025 at 5:07

1 vote

1 answer

133 views

OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets

I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint). In information-theoretic terms, surprisal is the negative base-2 ...

Odysseus Myresiotis Alivertis

11

asked May 20, 2025 at 15:08

2 votes

0 answers

51 views

How to get an exact substring match search with wildcards for Solr in ColdFusion?

I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only. Here's my sample code: <cfset criteriaString = '*#...

jadedQuail

127

asked Mar 13, 2025 at 21:34

1 vote

1 answer

125 views

How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?

I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer. To tokenize the text, I wrote the implementation as follows: from transformers import GPT2Tokenizer text = "...

RajibTheKing

1,372

asked Mar 3, 2025 at 22:32

0 votes

2 answers

259 views

Fixing Missing NLTK Tokenizer Resources

Repeated Lookup error eventhough NLTK is downloaded: Resource [93mpunkt_tab[0m not found. Please use the NLTK Downloader to obtain the resource: 31m>>> import nltk nltk.download('...

Ellster

1

asked Feb 27, 2025 at 21:00

1 vote

1 answer

100 views

How do I remove escape characters from output of nltk.word_tokenize?

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...

green_ruby

51

asked Feb 18, 2025 at 20:10

0 votes

0 answers

70 views

PunktTokenizer does not work with Russian `я.`

When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...

pepr

21.2k

asked Feb 3, 2025 at 9:12

Collectives™ on Stack Overflow

Removing and inserting an element into a vector after doing another operation with it

Splitting a string into tokens with several possible separators, using `std::ranges`

Using Langchain's MarkdownTextSplitter with a tokenizer

Does OpenAI API TPM limit count input tokens, output tokens, or both?

How can I match the token count used by BGE-M3 embedding model before embedding?

Convert SentencePiece tokenizer to ONNX

Strtok retains old data

Efficient multi-host TPU dataset processing

How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?

OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets

How to get an exact substring match search with wildcards for Solr in ColdFusion?

How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?

Fixing Missing NLTK Tokenizer Resources

How do I remove escape characters from output of nltk.word_tokenize?

PunktTokenizer does not work with Russian `я.`

Hot Network Questions