lexichash

package module

v0.5.3 Latest Latest Go to latest Published: Jun 2, 2026 License: MIT Imports: 14 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/shenwei356/lexichash

Links

Open Source Insights

README ¶

lexichash

This project implements LexicHash in Golang, with high performance and a low memory footprint.

This package is used in LexicMap.
Bit-packed k-mer operations are provided by kmers.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Documentation ¶

Index ¶

Variables
func Hash64(key uint64) uint64
func IsLowComplexity(code uint64, k int) bool
func MustDecode(code uint64, k uint8) []byte
func MustDecoder() func(code uint64, k uint8) []byte
type LexicHash

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrBrokenFile = errors.New("lexichash: broken file")

ErrBrokenFile means the file is not complete.

View Source

var ErrInsufficientMasks = errors.New("lexichash: insufficient masks (should be >=64)")

ErrInsufficientMasks means the number of masks is too small.

View Source

var ErrInvalidFileFormat = errors.New("lexichash: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source

var ErrKOverflow = errors.New("lexichash: k-mer size overflow, valid range is [3-32]")

ErrKOverflow means K > 32.

View Source

var ErrPrefixOverflow = errors.New("lexichash: prefix should be in range of [3, k]")

ErrPrefixOverflow means prefix > k.

View Source

var ErrVersionMismatch = errors.New("lexichash: version mismatch")

ErrVersionMismatch means version mismatch between files and program.

View Source

var Magic = [8]byte{'k', 'm', 'e', 'r', 'm', 'a', 's', 'k'}

View Source

var MainVersion uint8 = 0

View Source

var MinorVersion uint8 = 1

View Source

var Strands = [2]byte{'+', '-'}

Strands could be used to output strand for a reverse complement flag

Functions ¶

func Hash64 ¶ added in v0.2.0

func Hash64(key uint64) uint64

https://gist.github.com/badboy/6267743 . version with mask: https://gist.github.com/lh3/974ced188be2f90422cc .

func IsLowComplexity ¶

func IsLowComplexity(code uint64, k int) bool

IsLowComplexity checks if a k-mer is of low-complexity.

func MustDecoder ¶

func MustDecoder() func(code uint64, k uint8) []byte

MustDecoder returns a Decode function, which reuses the byte slice

Types ¶

type LexicHash ¶

type LexicHash struct {
	K int // max length of shared substrings, should be in range of [4, 31]

	Seed  int64    // seed for generating masks
	Masks []uint64 // masks/k-mers
	// contains filtered or unexported fields
}

LexicHash is for finding shared substrings between nucleotide sequences.

func New ¶

func New(k int, nMasks int, p int) (*LexicHash, error)

New returns a new LexicHash object. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func NewFromFile ¶

func NewFromFile(file string) (*LexicHash, error)

NewFromFile creates a LexicHash from a binary file.

func NewFromTextFile ¶ added in v0.3.0

func NewFromTextFile(file string) (*LexicHash, error)

NewFromTextFile creates a new LexicHash object with custom kmers in a txt file.

func NewWithMasks ¶ added in v0.3.0

func NewWithMasks(k int, masks []uint64) (*LexicHash, error)

NewWithMasks creates a new LexicHash object with custom kmers. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ...

func NewWithSeed ¶

func NewWithSeed(k int, nMasks int, randSeed int64, p int) (*LexicHash, error)

NewWithSeed creates a new LexicHash object with given seed. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func Read ¶

func Read(r io.Reader) (*LexicHash, error)

Read reads a LexiHash from an io.Reader.

func (*LexicHash) IndexMasks ¶ added in v0.4.0

func (lh *LexicHash) IndexMasks(p int) error

IndexMasks creates a lookup table (a slice) with the p-bp prefixes, then you can use MaskKnownPrefixes() to masks k-mers of which the prefixes are existed. The lenght of prefix p can't be too big, or the lookup table (a slice) would occupy a lot of space.

func (*LexicHash) IndexMasksWithDistinctPrefixes ¶ added in v0.5.0

func (lh *LexicHash) IndexMasksWithDistinctPrefixes(p int) error

IndexMasksWithDistinctPrefixes is similar to IndexMasks, but the size of p is tricky, it should be larger to ensure each prefix only refers to one mask. E.g., p == $(p in IndexMasks) + 1. Note that you also need to call IndexMasks.

func (*LexicHash) Mask ¶

func (lh *LexicHash) Mask(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

Mask computes the most similar substrings for each mask in sequence s. It returns

the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand).

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskKmer ¶ added in v0.4.0

func (lh *LexicHash) MaskKmer(kmer uint64) *[]int

MaskKmer returns the indexes of masks that possibly mask a k-mer. IndexMasks or IndexMasksWithDistinctPrefixes is recommended to run first. Don't forget to recycle the result via RecycleMaskKmerResult.

func (*LexicHash) MaskKnownDistinctPrefixes ¶ added in v0.5.0

func (lh *LexicHash) MaskKnownDistinctPrefixes(s []byte, skipRegions [][2]int, checkShorterPrefix bool) (*[]uint64, *[][]int, error)

MaskKnownDistinctPrefixes is similar to MaskKnownPrefixes. You need to call both IndexMasks and IndexMasksWithDistinctPrefixes first. When the prefixes of p' (in IndexMasksWithDistinctPrefixes) are distinct, it would be faster. For safety, checkShorterPrefix should be true to check shorter prefix (p in IndexMasks). E.g., p=7, and p'=8, for masks=40000.

E.g., We got masks with 3 8-bp prefixes below, one of them would refer to one specific mask.

AAAACCCA
AAAACCCG
AAAACCCT

But for AAAACCCc, it won't. So we have to check all masks with a shorter prefix (AAAACCC). While in some specific cases, there's no need to further check, like filling sketching deserts.

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskKnownPrefixes ¶ added in v0.4.0

func (lh *LexicHash) MaskKnownPrefixes(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskKnownPrefixes masks k-mers of which the prefixes are existed. So you need to run IndexMasks first.

It returns

the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand). Attention: It might be empty (len() == 0), if there's no k-mers are captured.

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskLongSeqs ¶ added in v0.2.0

func (lh *LexicHash) MaskLongSeqs(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskLongSeqs is faster than Mask() for longer sequences by using longer 5-bp prefixes for creating the lookup table, requiring nMasks >= 1024.

func (*LexicHash) RecycleMaskKmerResult ¶ added in v0.4.0

func (lh *LexicHash) RecycleMaskKmerResult(list *[]int)

RecycleMaskKmerResult recycles the result of MaskKmer()

func (*LexicHash) RecycleMaskResult ¶

func (lh *LexicHash) RecycleMaskResult(kmers *[]uint64, locses *[][]int)

RecycleMaskResult recycles the results of Mask(). Please do not forget to call this method after using the mask results.

func (*LexicHash) SupportSoftMasking ¶ added in v0.5.1

func (lh *LexicHash) SupportSoftMasking() *LexicHash

SupportSoftMasking treats lowercase bases in soft-masked low-complexity regions as A's. It should to be called before Mask* methods.

func (*LexicHash) Write ¶

func (lh *LexicHash) Write(w io.Writer) (int, error)

Write writes a LexicHash.

Header (32 bytes):

Magic number, 8 bytes, kmermask
Main and minor versions, 2 bytes
K, 1 byte
Blank, 5 bytes
Seed: 8 bytes
Number of masks: 8 bytes

Data: k-mers.

K-mers in uint64, 8*$(the number of maskes)

func (*LexicHash) WriteToFile ¶

func (lh *LexicHash) WriteToFile(file string) (int, error)

WriteToFile writes a LexicHash to a file, optional with file extensions of .gz, .xz, .zst, .bz2.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
iterator

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

lexichash

Related projects

Support

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Hash64 ¶ added in v0.2.0

func IsLowComplexity ¶

func MustDecode ¶

func MustDecoder ¶

Types ¶

type LexicHash ¶

func New ¶

func NewFromFile ¶

func NewFromTextFile ¶ added in v0.3.0

func NewWithMasks ¶ added in v0.3.0

func NewWithSeed ¶

func Read ¶

func (*LexicHash) IndexMasks ¶ added in v0.4.0

func (*LexicHash) IndexMasksWithDistinctPrefixes ¶ added in v0.5.0

func (*LexicHash) Mask ¶

func (*LexicHash) MaskKmer ¶ added in v0.4.0

func (*LexicHash) MaskKnownDistinctPrefixes ¶ added in v0.5.0

func (*LexicHash) MaskKnownPrefixes ¶ added in v0.4.0

func (*LexicHash) MaskLongSeqs ¶ added in v0.2.0

func (*LexicHash) RecycleMaskKmerResult ¶ added in v0.4.0

func (*LexicHash) RecycleMaskResult ¶

func (*LexicHash) SupportSoftMasking ¶ added in v0.5.1

func (*LexicHash) Write ¶

func (*LexicHash) WriteToFile ¶

Source Files ¶

Directories ¶