close
The Wayback Machine - https://web.archive.org/web/20201210063336/https://github.com/github/CodeSearchNet/issues/106
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

php with utf-8 tokenization are broken #106

Closed
Savier opened this issue Jan 31, 2020 · 1 comment
Closed

php with utf-8 tokenization are broken #106

Savier opened this issue Jan 31, 2020 · 1 comment
Labels

Comments

@Savier
Copy link

@Savier Savier commented Jan 31, 2020

I've tried to use php_dedupe_definitions_v2.pkl for my own project and found many functions with broken tokenization. For example, find functions with empty ('') tokens - there are above 8000 of that. Then, If we try to look for all 1-letter tokens we will get tons of 1-letter utf8 tokens which is impossible.

@hamelsmu
Copy link
Member

@hamelsmu hamelsmu commented Feb 9, 2020

Thanks @Savier, perhaps the code field can be used instead of the code_tokens field so you can do your own tokenization in the meantime? Thank you for letting us know about this issue!

@hamelsmu hamelsmu added the bug label Feb 9, 2020
@hamelsmu hamelsmu closed this Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.