Difference between revisions of "Term Frequency, Inverse Document Frequency (TF-IDF)"
| Line 1: | Line 1: | ||
[http://www.youtube.com/results?search_query=tf+idf+Term+Frequency+Inverse+Document+nlp+nli+natural+language Youtube search...] | [http://www.youtube.com/results?search_query=tf+idf+Term+Frequency+Inverse+Document+nlp+nli+natural+language Youtube search...] | ||
| − | * [[Natural Language Processing (NLP | + | * [[Natural Language Processing (NLP)]] |
* [[Scikit-learn]] Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib | * [[Scikit-learn]] Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib | ||
* [[Bag-of-Words (scikit-learn: Count Vectorizer)]] | * [[Bag-of-Words (scikit-learn: Count Vectorizer)]] | ||
Revision as of 07:09, 5 January 2019
- Natural Language Processing (NLP)
- Scikit-learn Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib
- Bag-of-Words (scikit-learn: Count Vectorizer)
- Word2Vec
- Doc2Vec
- Skip-Gram
- Global Vectors for Word Representation (GloVe)
This statistic represents words’ importance in each document. We use a word's frequency as a proxy for its importance: if "football" is mentioned 25 times in a document, it might be more important than if it was only mentioned once. We also use the document frequency (the number of documents containing a given word) as a measure of how common the word is. This minimizes the effect of stop-words such as pronouns, or domain-specific language that does not add much information (for example, a word such as "news" that might be present in most documents).