3,012 questions
Advice
0
votes
4
replies
109
views
Removing and inserting an element into a vector after doing another operation with it
I am creating a math parser using a tokenizing system, turning the user entered expression into Postfix/Reverse Polish Notation. I have set the function to turn a user-entered input into RPN, then I ...
13
votes
6
answers
1k
views
Splitting a string into tokens with several possible separators, using `std::ranges`
My goal is to split a std::string into tokens delimited by a list of possible delimiters/separators.
For instance std::string line{"\tSplit \t\t this sequence\t of tokens "}; must be ...
Advice
0
votes
0
replies
53
views
Using Langchain's MarkdownTextSplitter with a tokenizer
I'm not sure how to chunk a Markdown file with Langchain's MarkdownTextSplitter and at the same making sure the chunks don't overflow the maximum token size for the llm we will be using. As far as I ...
Advice
0
votes
0
replies
121
views
Does OpenAI API TPM limit count input tokens, output tokens, or both?
I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit.
If I have, for example, 2 million TPM, is that limit calculated based on:
only the input ...
1
vote
2
answers
664
views
How can I match the token count used by BGE-M3 embedding model before embedding?
For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...
1
vote
0
answers
223
views
Convert SentencePiece tokenizer to ONNX
I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones.
I'm using the ...
1
vote
2
answers
152
views
Strtok retains old data
I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...
0
votes
1
answer
60
views
Efficient multi-host TPU dataset processing
I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...
2
votes
1
answer
1k
views
How to properly save and load a PEFT-trained Unsloth model with resized token embeddings?
I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed:
...
1
vote
1
answer
133
views
OpenAI GPT-3 token logprobs and word-level surprisal: inconsistent values and missing outputs for multi-token targets
I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint).
In information-theoretic terms, surprisal is the negative base-2 ...
2
votes
0
answers
51
views
How to get an exact substring match search with wildcards for Solr in ColdFusion?
I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only.
Here's my sample code:
<cfset criteriaString = '*#...
1
vote
1
answer
125
views
How to handle German language specific characters like (ä, ö, ü, ß) while tokenizing using GPT2Tokenizer?
I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer.
To tokenize the text, I wrote the implementation as follows:
from transformers import GPT2Tokenizer
text = "...
0
votes
2
answers
259
views
Fixing Missing NLTK Tokenizer Resources
Repeated Lookup error eventhough NLTK is downloaded:
Resource [93mpunkt_tab[0m not found.
Please use the NLTK Downloader to obtain the resource:
31m>>> import nltk
nltk.download('...
1
vote
1
answer
100
views
How do I remove escape characters from output of nltk.word_tokenize?
How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
0
votes
0
answers
70
views
PunktTokenizer does not work with Russian `я.`
When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...