Difference between revisions of "Term Frequency, Inverse Document Frequency (TF-IDF)"
m (BPeat moved page Term Frequency, Inverse Document Frequency (tf-idf) to Term Frequency, Inverse Document Frequency (TF-IDF) without leaving a redirect) |
|
(No difference)
|
Revision as of 11:16, 8 June 2019
- Natural Language Processing (NLP)
- Scikit-learn Machine Learning in Python, Simple and efficient tools for data mining and data analysis; Built on NumPy, SciPy, and matplotlib
- Bag-of-Words
- Word2Vec
- Doc2Vec
- Skip-Gram
- Global Vectors for Word Representation (GloVe)
This statistic represents words’ importance in each document. We use a word's frequency as a proxy for its importance: if "football" is mentioned 25 times in a document, it might be more important than if it was only mentioned once. We also use the document frequency (the number of documents containing a given word) as a measure of how common the word is. This minimizes the effect of stop-words such as pronouns, or domain-specific language that does not add much information (for example, a word such as "news" that might be present in most documents).
Cosine Similarity
TF-IDF is a transformation applied to texts to get two real-valued vectors in vector space. We can then obtain the Cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors. Cosine similarity is a measure of similarity between two non-zero vectors. Using this formula we can find out the similarity between any two documents d1 and d2. Building a Simple Chatbot from Scratch in Python (using NLTK) | Paul Pandey - Analytics Vidhya