Documentation
¶
Index ¶
- func IsAnyAlphabet(text string) bool
- func IsInArray(text string, arr []string) bool
- func IsLower(text string) bool
- func IsNumber(text string) bool
- func NonBreakingPrefixesLoader(lang string) (result []string)
- func PerlPropsLoader(ext string) string
- func RemoveEmptyStringFromSlice(texts []string) (result []string)
- type Normalizer
- type Replacement
- type Tokenizer
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IsAnyAlphabet ¶
IsAnyAlphabet checks if alphabet character exist at least once in any string
func IsLower ¶
IsLower checks whether all characters in a string are consisted of lowercase characters
func NonBreakingPrefixesLoader ¶
NonBreakingPrefixesLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm
func PerlPropsLoader ¶
PerlPropsLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm
func RemoveEmptyStringFromSlice ¶
RemoveEmptyStringFromSlice will check if any empty string exists on an array and remove them
Types ¶
type Normalizer ¶
type Normalizer struct {
// contains filtered or unexported fields
}
Normalizer is a golang port of the MOses punctuation normalizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py
func NewNormalizer ¶
func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, preReplaceUniPunct bool, postRemoveCtrlChars bool) *Normalizer
NewNormalizer create new instance of normalizer. Several parameters are provided to disable specific rules for normalization such as quote normalization, number normalization and unicode normalization
func (Normalizer) Normalize ¶
func (n Normalizer) Normalize(text string) (normalizedText string)
Normalize the incoming text according to pre-defined rules
type Replacement ¶
type Replacement struct {
// contains filtered or unexported fields
}
Replacement is a tuple consisting of regex and their substitution
func Flatten ¶
func Flatten(r [][]Replacement) []Replacement
Flatten will reduce dimensionality (2d to 1d) of the arguments
func NewReplacement ¶
func NewReplacement(rgx string, sub string) (replacement Replacement)
NewReplacement creates Replacement object
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer is an instance to tokenize text. This is a golang port of Moses Tokenizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py
func NewTokenizer ¶
NewTokenizer creates new Tokenizer instance with predefined language
func (Tokenizer) Tokenize ¶
func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)
Tokenize incoming string in accordance to predefined language option. We can choose to enable more aggresive dash splitting such as "foo-bar" to "foo @-@ bar" and escaping XML tags