Difference between revisions of "Term Frequency, Inverse Document Frequency (TF-IDF)"

From
Jump to: navigation, search
Line 3: Line 3:
 
* [[Natural Language Processing (NLP)]]
 
* [[Natural Language Processing (NLP)]]
 
* [[Scikit-learn]] Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib
 
* [[Scikit-learn]] Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib
* [[Bag-of-Words (scikit-learn: Count Vectorizer)]]
+
* [[Bag-of-Words]]
 
* [[Word2Vec]]
 
* [[Word2Vec]]
 
* [[Doc2Vec]]
 
* [[Doc2Vec]]

Revision as of 13:50, 20 April 2019

Youtube search...

This statistic represents words’ importance in each document. We use a word's frequency as a proxy for its importance: if "football" is mentioned 25 times in a document, it might be more important than if it was only mentioned once. We also use the document frequency (the number of documents containing a given word) as a measure of how common the word is. This minimizes the effect of stop-words such as pronouns, or domain-specific language that does not add much information (for example, a word such as "news" that might be present in most documents).