The task of the TF-IDF exercise is to compute the term-frequency/inverted-document-frequency (TF-IDF) metric for words in mails of the Flink developer mailing list archives.
TF-IDF is a metric to measure the importance of a term (word) in a document that is part of a collection of documents. It is commonly used to weight the importance of search terms in search engines.
TF-IDF is defined in two parts:
- The term-frequency
tf(t, d)is the frequency of a term
tin a document
tf('house', d1)is equal to three if document
d1contains the word
- The inverted-document-frequency
idf(t, D)is the inverted fraction of documents that contain a term
tin a collection of documents
idf('house', D) = 1.25if four documents in a collection of five documents contain the word
'house'(5 / 4 = 1.25).
The TF-IDF metric for a term
t in document
d of a collection
D is computed as
tf-idf(t, d, D) = tf(t, d) * idf(t, D).
Following this definition, a term has a high TF-IDF score (and is considered to be representative for a document) if it occurs very often in the document, but is rare in the overall collection of documents.
This exercise uses the Mail Data Set which was extracted from the Apache Flink development mailing list archive. The Mail Data Set instructions show how to read the data set in a Flink program using the
The task requires two fields,
Body. The input data can be read as a
DataSet<Tuple2<String, String>>. When printed, the data set should look similar to this:
(<CAGr9p8A8Z7P787=c5RF5QbPKudLmPUsV3jCHKefZbwm=0UF-GA@mail.gmail.com>, --047d7ba975c83720ed05122e218e Content-Type: text/plain; charset=UTF-8 I didn't know that there was already an issue for this. I closed FLINK-1787. The correct issue is this one: https://issues.apache.org/jira/browse/FLINK-1711 --047d7ba975c83720ed05122e218e-- ) (<CANC1h_tGSdNdjO0YqQDPhBBtordrsMEOL8z1Rxwgy12LYB8aoQ@mail.gmail.com>, --089e011823584ee26304fc0927d5 Content-Type: text/plain; charset=UTF-8 Yep, this looks like a bug, I agree. --089e011823584ee26304fc0927d5-- )
The result of the program should be a
DataSet<Tuple3<String, String, Double>> where the first field is the
MessageId, the second field is the
Word, and the third field is the
When printed the data set should look like this:
(<CAPud8TrjpXYyqo6ur05-sjhUwRSh10XJVnCx_RuW9aUNKqjUTw@mail.gmail.com>,slide,2901.0) (<CAPud8TrjpXYyqo6ur05-sjhUwRSh10XJVnCx_RuW9aUNKqjUTw@mail.gmail.com>,might,7.7774) (<CAPud8TrjpXYyqo6ur05-sjhUwRSh10XJVnCx_RuW9aUNKqjUTw@mail.gmail.com>,case,21.1751) (<CANC1h_s+BoRd7YMs5JTyhouDSC4x+5whH5feyfZOhG=_N7ZUsA@mail.gmail.com>,ship,68.2588) (<CANC1h_s+BoRd7YMs5JTyhouDSC4x+5whH5feyfZOhG=_N7ZUsA@mail.gmail.com>,api,34.0721)
The first line of the example output indicates that the word
slide has an TF-IDF score of
2901.0 in the mail with the ID
BodyString by whitespaces, filtering for tokens with alphabetical characters only, and converting all letters to lowercase. It is also common to filter out words that frequently occur in texts (so-called stop words) such as: “the”, “i”, “a”, “an”, “at”, “are”, “am”, “for”, “and”, “or”, “is”, “there”, “it”, “this”, “that”, “on”, “was”, “by”, “of”, “to”, “in”, “to”, “not”, “be”, “with”, “you”, “have”, “as”, “can”
DataSet.count()method which returns a
longvalue with the number of elements in the data set.
count()is eagerly executed and does not require an
longvalue can be shared with the function that computes
IDFas a constructor parameter.
FlatMapFunction. The function receives one document at a time, extracts all words, counts the occurrences of each unique word in the text (e.g., using a hash map), and emits one tuple with three fields
(UniqueMID, Word, TF)for each counted word.
Wordfield, and compute the corresponding TF-IDF score.
Reference solutions are available at GitHub: