close
Skip to main content
Filter by
Sorted by
Tagged with
Advice
0 votes
4 replies
109 views

I am creating a math parser using a tokenizing system, turning the user entered expression into Postfix/Reverse Polish Notation. I have set the function to turn a user-entered input into RPN, then I ...
Chetan Poudel's user avatar
13 votes
6 answers
1k views

My goal is to split a std::string into tokens delimited by a list of possible delimiters/separators. For instance std::string line{"\tSplit \t\t this sequence\t of tokens "}; must be ...
Oersted's user avatar
  • 5,187
Advice
0 votes
0 replies
53 views

I'm not sure how to chunk a Markdown file with Langchain's MarkdownTextSplitter and at the same making sure the chunks don't overflow the maximum token size for the llm we will be using. As far as I ...
michielve's user avatar
  • 579
Advice
0 votes
0 replies
121 views

I’m a bit confused about how OpenAI’s API rate limits work - specifically the TPM (tokens per minute) limit. If I have, for example, 2 million TPM, is that limit calculated based on: only the input ...
Adabler's user avatar
  • 34
1 vote
2 answers
664 views

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the ...
ManBearPigeon's user avatar
1 vote
0 answers
223 views

I'm developing (Python) an FAQ system based on embeddings to perform similarity queries between a user's question and the FAQ knowledge base. The FAQ needs to run on Android smartphones. I'm using the ...
ltu's user avatar
  • 177
1 vote
2 answers
152 views

I am currently writing a shell after taking a bit of a break from C. and I have found this problem with strtok. if I were to write "cd ../" on one line and then "ls" on the next it ...
The QNX girl's user avatar
0 votes
1 answer
60 views

I want to train LLM on TPUv4-32 using JAX/Flax. The dataset is stored in a mounted google storage bucket. The dataset (Red-Pajama-v2) consists of 5000 shards, which are stored in .json.gz files: ~/...
innerproduct's user avatar
2 votes
1 answer
1k views

I'm using Unsloth's FastVisionModel with the base model unsloth/qwen2-VL-2B-Instruct to train on a dataset that includes text with many unique characters. Here's the overall process I followed: ...
GauravGiri's user avatar
1 vote
1 answer
133 views

I’m trying to compute word-level surprisal values for a set of sentence stimuli using OpenAI’s Completions API (legacy endpoint). In information-theoretic terms, surprisal is the negative base-2 ...
Odysseus Myresiotis Alivertis's user avatar
2 votes
0 answers
51 views

I am trying to implement a search in ColdFusion (with indexing through Solr) where it gets a match on exact substrings and exact substrings only. Here's my sample code: <cfset criteriaString = '*#...
jadedQuail's user avatar
1 vote
1 answer
125 views

I am working with German Texts, where I need to tokenize texts using GPT2Tokenizer. To tokenize the text, I wrote the implementation as follows: from transformers import GPT2Tokenizer text = "...
RajibTheKing's user avatar
  • 1,372
0 votes
2 answers
259 views

Repeated Lookup error eventhough NLTK is downloaded: Resource [93mpunkt_tab[0m not found. Please use the NLTK Downloader to obtain the resource: 31m>>> import nltk nltk.download('...
Ellster's user avatar
1 vote
1 answer
100 views

How do I get rid of non-printing (escaped) characters from the output of the nltk.word_tokenize method? I am working through the book 'Natural Language Processing with Python' and am following the ...
green_ruby's user avatar
0 votes
0 answers
70 views

When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я. at the end of the ...
pepr's user avatar
  • 21.2k

15 30 50 per page
1
2 3 4 5
201