The task of the “Member of the Month” exercise is to find for each month the email address that sent the most emails to the Apache Flink development mailing list.
This exercise uses the Mail Data Set which was extracted from the Apache Flink development mailing list archive. The Mail Data Set instructions show how to read the data set in a Flink program using the
The task requires two fields,
Sender. The input data can be read as a
DataSet<Tuple2<String, String>>. When printed, the data set should look similar to this:
(2014-09-26-08:49:58,Fabian Hueske <email@example.com>) (2014-09-12-14:50:38,Aljoscha Krettek <firstname.lastname@example.org>) (2014-09-30-09:16:29,Stephan Ewen <email@example.com>)
The result of the task should be a
DataSet<Tuple3<String, String, Integer>>. The first field specifies the month, the second field the email address that sent the most emails to the mailing list in the given month. When printed, the data set should looks like:
(2014-07,firstname.lastname@example.org) (2014-06,email@example.com) (2015-05,firstname.lastname@example.org) (2015-02,email@example.com) (2014-08,firstname.lastname@example.org)
The first line of the example result indicates that in July 2014,
email@example.com sent the most emails to the Flink developer mailing list and is therefore the member of that month.
After the data was brought into a structured format, the remaining analysis can be done using the Table API in three steps. First, compute the number of mails per month and email address, second compute the maximum number of mails from a single address per month, and finally find the email address that sent the most emails of a month.
Maptransformation is used for record-at-a-time processing and should be used to extract the relevant information from from the input data, i.e., the month from the timestamp field and the email address from the sender field.
Reference solutions are available at GitHub: