gmt

package module

v0.0.0-...-0b198b6 Latest Latest Go to latest Published: Mar 24, 2020 License: Apache-2.0 Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/akurniawan/GMT

Links

Open Source Insights

README ¶

GMT

Golang port of Moses tokenizer and normalizer

You can refer to the following repositories for reference to the original code

Features & Limitation

Currently the port is only for tokenizer and normalizer for english and non-chinese languages. While the original sacremoses has detokenizer and true casing as well, they are not yet currently implemented.

Install

go get github.com/akurniawan/GMT

Usage

Tokenizer

tokenizer := NewTokenizer("en")
text := "This, weird\xbb symbols\u2026 appearing everywhere\xbf"
exptected := "This , weird \xbb symbols \u2026 appearing everywhere \xbf"
tokenized := tokenizer.Tokenize(text, false, true)
println(text == expected)

Normalizer

normalizer := NewNormalizer("en", true, true, true, false, false)
text := "12\u00A0123"
exptected := "12.123"
normalized := normalizer.mlizedmmmmmmmalse, true)
println(text == normalized)

Documentation ¶

Index ¶

func IsAnyAlphabet(text string) bool
func IsInArray(text string, arr []string) bool
func IsLower(text string) bool
func IsNumber(text string) bool
func NonBreakingPrefixesLoader(lang string) (result []string)
func PerlPropsLoader(ext string) string
func RemoveEmptyStringFromSlice(texts []string) (result []string)
type Normalizer
- func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, ...) *Normalizer
- func (n Normalizer) Normalize(text string) (normalizedText string)
type Replacement
- func Flatten(r [][]Replacement) []Replacement
- func NewReplacement(rgx string, sub string) (replacement Replacement)
type Tokenizer
- func NewTokenizer(lang string) (tokenizer *Tokenizer)
- func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsAnyAlphabet ¶

func IsAnyAlphabet(text string) bool

IsAnyAlphabet checks if alphabet character exist at least once in any string

func IsInArray ¶

func IsInArray(text string, arr []string) bool

IsInArray checks if text is available in arr

func IsLower ¶

func IsLower(text string) bool

IsLower checks whether all characters in a string are consisted of lowercase characters

func IsNumber ¶

func IsNumber(text string) bool

IsNumber checks whether all characters in a string are consisted of numbers

func NonBreakingPrefixesLoader ¶

func NonBreakingPrefixesLoader(lang string) (result []string)

NonBreakingPrefixesLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func PerlPropsLoader ¶

func PerlPropsLoader(ext string) string

PerlPropsLoader is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

func RemoveEmptyStringFromSlice ¶

func RemoveEmptyStringFromSlice(texts []string) (result []string)

RemoveEmptyStringFromSlice will check if any empty string exists on an array and remove them

Types ¶

type Normalizer ¶

type Normalizer struct {
	// contains filtered or unexported fields
}

Normalizer is a golang port of the MOses punctuation normalizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py

func NewNormalizer ¶

func NewNormalizer(lang string, penn bool, normQuoteCommas bool, normNumbers bool, preReplaceUniPunct bool, postRemoveCtrlChars bool) *Normalizer

NewNormalizer create new instance of normalizer. Several parameters are provided to disable specific rules for normalization such as quote normalization, number normalization and unicode normalization

func (Normalizer) Normalize ¶

func (n Normalizer) Normalize(text string) (normalizedText string)

Normalize the incoming text according to pre-defined rules

type Replacement ¶

type Replacement struct {
	// contains filtered or unexported fields
}

Replacement is a tuple consisting of regex and their substitution

func Flatten ¶

func Flatten(r [][]Replacement) []Replacement

Flatten will reduce dimensionality (2d to 1d) of the arguments

func NewReplacement ¶

func NewReplacement(rgx string, sub string) (replacement Replacement)

NewReplacement creates Replacement object

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer is an instance to tokenize text. This is a golang port of Moses Tokenizer from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl Designs are mostly copied from the python version https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py

func NewTokenizer ¶

func NewTokenizer(lang string) (tokenizer *Tokenizer)

NewTokenizer creates new Tokenizer instance with predefined language

func (Tokenizer) Tokenize ¶

func (t Tokenizer) Tokenize(text string, aggresiveDashSplits bool, escapeXML bool) (string, []string)

Tokenize incoming string in accordance to predefined language option. We can choose to enable more aggresive dash splitting such as "foo-bar" to "foo @-@ bar" and escaping XML tags

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL