(On the other hand, be careful with the max_vocab_size parameter. So if you know you have the system memory to support a 1-million-word model, you don't have to use trial-and-error on min_count values to reach something near that: you can just specify max_final_vocab=1000000, min_count=1. Note that for gensim's Word2Vec model, and related algorithms, in addition to the min_count parameter which discards all words appearing fewer times than that value, there is also the max_final_vocab parameter, which will dynamically choose whatever min_count is sufficient to achieve a final vocabulary size no larger than the max_final_vocab value. Discard them, or not, based on whether their presence helps or hurts your later steps & final results – not to save a tiny amount of vocabulary-driven model space. On the other hand, 'stop words' are insufficiently numerous to offer much vocabulary-size savings when discarded. (Similarly, in adequate corpuses, using more-aggressive frequent-word downsampling, as controlled by the sample parameter, can often increase word-vector quality while also speeding training – though with no savings in overall vocabulary size, as no words are totally eliminated by that setting.) Those more-important words have fewer intervening lower-value 'noisy' words moving them out of each others' context windows, or pulling the model's weights in other directions via interleaved training examples. So just discarding words with one to a few instances will often significantly shrink the vocabulary (and thus overall model) by half or more.įinally, it's been observed in 'word2vec' that discarding those intervening rare words – which are many in total number, though each individually has only limited-quality examples – the quality of the surviving more-frequent word-vectors often improves. There may not be enough varied examples of the usage of each rare word to learn reliable representations – as opposed to weak/idiosyncratic representations based on the few not-necessarily-representative examples of their use that you do have.Īlso, rare words won't recur as often in future texts, making their relative value in the model less.Īnd, by the typical 'zipfian' distribution of word-frequencies in natural-language material, while each individual rare word only occurs a few times, altogether there are many such words. This is especially the case for 'word2vec' and similar algorithms. In general, the least-frequent words in your training data are also the safest to discard. For the exact algorithm, though, I highly recommend you to look into the linked paper or implementations. ![]() With subword units, you now basically have the freedom to determine a fix vocabulary size, and the algorithm will then try to optimize towards a mix of word diversity, and basically splitting "more complex words" into several pieces, such that your desired vocabulary size can cover any word in the corpus. The two biggest implementations of this idea are probably fastBPE and Google's SentencePiece. The idea itself is relatively old (an article from 1994), but has been recently popularized by Sennrich et al., and is basically used in every state-of-the-art NLP library that has to deal with large vocabularies. ![]() The most notable ones are probably subword units (also known as Byte Pair Encodings, or BPEs), which you can imagine as a notion similar to syllables in a word (but not the same!) A word like basketball could then be transformed into variations like bas or basket Note that this is a constructed example and might not reflect the actually chosen subwords. There is indeed a few recent developments that try to counteract this problem.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |