close

lexichash

package module
v0.5.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 2, 2026 License: MIT Imports: 14 Imported by: 1

Image README

lexichash

Go Reference

This project implements LexicHash in Golang, with high performance and a low memory footprint.

  • This package is used in LexicMap.
  • Bit-packed k-mer operations are provided by kmers.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Image Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrBrokenFile = errors.New("lexichash: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrInsufficientMasks = errors.New("lexichash: insufficient masks (should be >=64)")

ErrInsufficientMasks means the number of masks is too small.

View Source
var ErrInvalidFileFormat = errors.New("lexichash: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrKOverflow = errors.New("lexichash: k-mer size overflow, valid range is [3-32]")

ErrKOverflow means K > 32.

View Source
var ErrPrefixOverflow = errors.New("lexichash: prefix should be in range of [3, k]")

ErrPrefixOverflow means prefix > k.

View Source
var ErrVersionMismatch = errors.New("lexichash: version mismatch")

ErrVersionMismatch means version mismatch between files and program.

View Source
var Magic = [8]byte{'k', 'm', 'e', 'r', 'm', 'a', 's', 'k'}
View Source
var MainVersion uint8 = 0
View Source
var MinorVersion uint8 = 1
View Source
var Strands = [2]byte{'+', '-'}

Strands could be used to output strand for a reverse complement flag

Functions

func IsLowComplexity

func IsLowComplexity(code uint64, k int) bool

IsLowComplexity checks if a k-mer is of low-complexity.

func MustDecode

func MustDecode(code uint64, k uint8) []byte

MustDecode return k-mer string

func MustDecoder

func MustDecoder() func(code uint64, k uint8) []byte

MustDecoder returns a Decode function, which reuses the byte slice

Types

type LexicHash

type LexicHash struct {
	K int // max length of shared substrings, should be in range of [4, 31]

	Seed  int64    // seed for generating masks
	Masks []uint64 // masks/k-mers
	// contains filtered or unexported fields
}

LexicHash is for finding shared substrings between nucleotide sequences.

func New

func New(k int, nMasks int, p int) (*LexicHash, error)

New returns a new LexicHash object. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func NewFromFile

func NewFromFile(file string) (*LexicHash, error)

NewFromFile creates a LexicHash from a binary file.

func NewFromTextFile added in v0.3.0

func NewFromTextFile(file string) (*LexicHash, error)

NewFromTextFile creates a new LexicHash object with custom kmers in a txt file.

func NewWithMasks added in v0.3.0

func NewWithMasks(k int, masks []uint64) (*LexicHash, error)

NewWithMasks creates a new LexicHash object with custom kmers. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ...

func NewWithSeed

func NewWithSeed(k int, nMasks int, randSeed int64, p int) (*LexicHash, error)

NewWithSeed creates a new LexicHash object with given seed. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.

func Read

func Read(r io.Reader) (*LexicHash, error)

Read reads a LexiHash from an io.Reader.

func (*LexicHash) IndexMasks added in v0.4.0

func (lh *LexicHash) IndexMasks(p int) error

IndexMasks creates a lookup table (a slice) with the p-bp prefixes, then you can use MaskKnownPrefixes() to masks k-mers of which the prefixes are existed. The lenght of prefix p can't be too big, or the lookup table (a slice) would occupy a lot of space.

func (*LexicHash) IndexMasksWithDistinctPrefixes added in v0.5.0

func (lh *LexicHash) IndexMasksWithDistinctPrefixes(p int) error

IndexMasksWithDistinctPrefixes is similar to IndexMasks, but the size of p is tricky, it should be larger to ensure each prefix only refers to one mask. E.g., p == $(p in IndexMasks) + 1. Note that you also need to call IndexMasks.

func (*LexicHash) Mask

func (lh *LexicHash) Mask(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

Mask computes the most similar substrings for each mask in sequence s. It returns

  1. the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
  2. the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand).

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskKmer added in v0.4.0

func (lh *LexicHash) MaskKmer(kmer uint64) *[]int

MaskKmer returns the indexes of masks that possibly mask a k-mer. IndexMasks or IndexMasksWithDistinctPrefixes is recommended to run first. Don't forget to recycle the result via RecycleMaskKmerResult.

func (*LexicHash) MaskKnownDistinctPrefixes added in v0.5.0

func (lh *LexicHash) MaskKnownDistinctPrefixes(s []byte, skipRegions [][2]int, checkShorterPrefix bool) (*[]uint64, *[][]int, error)

MaskKnownDistinctPrefixes is similar to MaskKnownPrefixes. You need to call both IndexMasks and IndexMasksWithDistinctPrefixes first. When the prefixes of p' (in IndexMasksWithDistinctPrefixes) are distinct, it would be faster. For safety, checkShorterPrefix should be true to check shorter prefix (p in IndexMasks). E.g., p=7, and p'=8, for masks=40000.

E.g., We got masks with 3 8-bp prefixes below, one of them would refer to one specific mask.

AAAACCCA
AAAACCCG
AAAACCCT

But for AAAACCCc, it won't. So we have to check all masks with a shorter prefix (AAAACCC). While in some specific cases, there's no need to further check, like filling sketching deserts.

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskKnownPrefixes added in v0.4.0

func (lh *LexicHash) MaskKnownPrefixes(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskKnownPrefixes masks k-mers of which the prefixes are existed. So you need to run IndexMasks first.

It returns

  1. the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
  2. the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand). Attention: It might be empty (len() == 0), if there's no k-mers are captured.

skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.

The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...

func (*LexicHash) MaskLongSeqs added in v0.2.0

func (lh *LexicHash) MaskLongSeqs(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)

MaskLongSeqs is faster than Mask() for longer sequences by using longer 5-bp prefixes for creating the lookup table, requiring nMasks >= 1024.

func (*LexicHash) RecycleMaskKmerResult added in v0.4.0

func (lh *LexicHash) RecycleMaskKmerResult(list *[]int)

RecycleMaskKmerResult recycles the result of MaskKmer()

func (*LexicHash) RecycleMaskResult

func (lh *LexicHash) RecycleMaskResult(kmers *[]uint64, locses *[][]int)

RecycleMaskResult recycles the results of Mask(). Please do not forget to call this method after using the mask results.

func (*LexicHash) SupportSoftMasking added in v0.5.1

func (lh *LexicHash) SupportSoftMasking() *LexicHash

SupportSoftMasking treats lowercase bases in soft-masked low-complexity regions as A's. It should to be called before Mask* methods.

func (*LexicHash) Write

func (lh *LexicHash) Write(w io.Writer) (int, error)

Write writes a LexicHash.

Header (32 bytes):

Magic number, 8 bytes, kmermask
Main and minor versions, 2 bytes
K, 1 byte
Blank, 5 bytes
Seed: 8 bytes
Number of masks: 8 bytes

Data: k-mers.

K-mers in uint64, 8*$(the number of maskes)

func (*LexicHash) WriteToFile

func (lh *LexicHash) WriteToFile(file string) (int, error)

WriteToFile writes a LexicHash to a file, optional with file extensions of .gz, .xz, .zst, .bz2.

Image Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL