Documentation
¶
Index ¶
- Variables
- func Hash64(key uint64) uint64
- func IsLowComplexity(code uint64, k int) bool
- func MustDecode(code uint64, k uint8) []byte
- func MustDecoder() func(code uint64, k uint8) []byte
- type LexicHash
- func New(k int, nMasks int, p int) (*LexicHash, error)
- func NewFromFile(file string) (*LexicHash, error)
- func NewFromTextFile(file string) (*LexicHash, error)
- func NewWithMasks(k int, masks []uint64) (*LexicHash, error)
- func NewWithSeed(k int, nMasks int, randSeed int64, p int) (*LexicHash, error)
- func Read(r io.Reader) (*LexicHash, error)
- func (lh *LexicHash) IndexMasks(p int) error
- func (lh *LexicHash) IndexMasksWithDistinctPrefixes(p int) error
- func (lh *LexicHash) Mask(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)
- func (lh *LexicHash) MaskKmer(kmer uint64) *[]int
- func (lh *LexicHash) MaskKnownDistinctPrefixes(s []byte, skipRegions [][2]int, checkShorterPrefix bool) (*[]uint64, *[][]int, error)
- func (lh *LexicHash) MaskKnownPrefixes(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)
- func (lh *LexicHash) MaskLongSeqs(s []byte, skipRegions [][2]int) (*[]uint64, *[][]int, error)
- func (lh *LexicHash) RecycleMaskKmerResult(list *[]int)
- func (lh *LexicHash) RecycleMaskResult(kmers *[]uint64, locses *[][]int)
- func (lh *LexicHash) SupportSoftMasking() *LexicHash
- func (lh *LexicHash) Write(w io.Writer) (int, error)
- func (lh *LexicHash) WriteToFile(file string) (int, error)
Constants ¶
This section is empty.
Variables ¶
var ErrBrokenFile = errors.New("lexichash: broken file")
ErrBrokenFile means the file is not complete.
var ErrInsufficientMasks = errors.New("lexichash: insufficient masks (should be >=64)")
ErrInsufficientMasks means the number of masks is too small.
var ErrInvalidFileFormat = errors.New("lexichash: invalid binary format")
ErrInvalidFileFormat means invalid file format.
var ErrKOverflow = errors.New("lexichash: k-mer size overflow, valid range is [3-32]")
ErrKOverflow means K > 32.
var ErrPrefixOverflow = errors.New("lexichash: prefix should be in range of [3, k]")
ErrPrefixOverflow means prefix > k.
var ErrVersionMismatch = errors.New("lexichash: version mismatch")
ErrVersionMismatch means version mismatch between files and program.
var Magic = [8]byte{'k', 'm', 'e', 'r', 'm', 'a', 's', 'k'}
var MainVersion uint8 = 0
var MinorVersion uint8 = 1
var Strands = [2]byte{'+', '-'}
Strands could be used to output strand for a reverse complement flag
Functions ¶
func Hash64 ¶ added in v0.2.0
https://gist.github.com/badboy/6267743 . version with mask: https://gist.github.com/lh3/974ced188be2f90422cc .
func IsLowComplexity ¶
IsLowComplexity checks if a k-mer is of low-complexity.
func MustDecoder ¶
MustDecoder returns a Decode function, which reuses the byte slice
Types ¶
type LexicHash ¶
type LexicHash struct {
K int // max length of shared substrings, should be in range of [4, 31]
Seed int64 // seed for generating masks
Masks []uint64 // masks/k-mers
// contains filtered or unexported fields
}
LexicHash is for finding shared substrings between nucleotide sequences.
func New ¶
New returns a new LexicHash object. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.
func NewFromFile ¶
NewFromFile creates a LexicHash from a binary file.
func NewFromTextFile ¶ added in v0.3.0
NewFromTextFile creates a new LexicHash object with custom kmers in a txt file.
func NewWithMasks ¶ added in v0.3.0
NewWithMasks creates a new LexicHash object with custom kmers. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ...
func NewWithSeed ¶
NewWithSeed creates a new LexicHash object with given seed. nMasks should be >=64, and better be >= 1024 and better be power of 4, i.e., 64, 256, 1024, 4096 ... p is the length of mask k-mer prefixes which need to be checked for low-complexity. p == 0 for no checking.
func (*LexicHash) IndexMasks ¶ added in v0.4.0
IndexMasks creates a lookup table (a slice) with the p-bp prefixes, then you can use MaskKnownPrefixes() to masks k-mers of which the prefixes are existed. The lenght of prefix p can't be too big, or the lookup table (a slice) would occupy a lot of space.
func (*LexicHash) IndexMasksWithDistinctPrefixes ¶ added in v0.5.0
IndexMasksWithDistinctPrefixes is similar to IndexMasks, but the size of p is tricky, it should be larger to ensure each prefix only refers to one mask. E.g., p == $(p in IndexMasks) + 1. Note that you also need to call IndexMasks.
func (*LexicHash) Mask ¶
Mask computes the most similar substrings for each mask in sequence s. It returns
- the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
- the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand).
skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.
The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...
func (*LexicHash) MaskKmer ¶ added in v0.4.0
MaskKmer returns the indexes of masks that possibly mask a k-mer. IndexMasks or IndexMasksWithDistinctPrefixes is recommended to run first. Don't forget to recycle the result via RecycleMaskKmerResult.
func (*LexicHash) MaskKnownDistinctPrefixes ¶ added in v0.5.0
func (lh *LexicHash) MaskKnownDistinctPrefixes(s []byte, skipRegions [][2]int, checkShorterPrefix bool) (*[]uint64, *[][]int, error)
MaskKnownDistinctPrefixes is similar to MaskKnownPrefixes. You need to call both IndexMasks and IndexMasksWithDistinctPrefixes first. When the prefixes of p' (in IndexMasksWithDistinctPrefixes) are distinct, it would be faster. For safety, checkShorterPrefix should be true to check shorter prefix (p in IndexMasks). E.g., p=7, and p'=8, for masks=40000.
E.g., We got masks with 3 8-bp prefixes below, one of them would refer to one specific mask.
AAAACCCA AAAACCCG AAAACCCT
But for AAAACCCc, it won't. So we have to check all masks with a shorter prefix (AAAACCC). While in some specific cases, there's no need to further check, like filling sketching deserts.
skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.
The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...
func (*LexicHash) MaskKnownPrefixes ¶ added in v0.4.0
MaskKnownPrefixes masks k-mers of which the prefixes are existed. So you need to run IndexMasks first.
It returns
- the list of the most similar k-mers for each mask. Note that k-mers of all A's or N's are skipped in k-mer generation step.
- the start 0-based positions of all k-mers, with the last 1 bit as the strand flag (1 for negative strand). Attention: It might be empty (len() == 0), if there's no k-mers are captured.
skipRegions is optional, which is used to skip some masked regions. E.g., in reference indexing step, contigs of a genome can be concatenated with k-1 N's, where need to be ommitted.
The regions should be 0-based and ascendingly sorted. e.g., [100, 130], [200, 230] ...
func (*LexicHash) MaskLongSeqs ¶ added in v0.2.0
MaskLongSeqs is faster than Mask() for longer sequences by using longer 5-bp prefixes for creating the lookup table, requiring nMasks >= 1024.
func (*LexicHash) RecycleMaskKmerResult ¶ added in v0.4.0
RecycleMaskKmerResult recycles the result of MaskKmer()
func (*LexicHash) RecycleMaskResult ¶
RecycleMaskResult recycles the results of Mask(). Please do not forget to call this method after using the mask results.
func (*LexicHash) SupportSoftMasking ¶ added in v0.5.1
SupportSoftMasking treats lowercase bases in soft-masked low-complexity regions as A's. It should to be called before Mask* methods.