
 /* This is a Pig template file to get tf-idf calculated for all the NSF grants. Before starting off,
 * 1) Unzip the zip file in your local directory
 * 2) Use bin/hadoop fs -cp 
 * /
  
 /* We have to load all the files in hdfs grants directory. This can be done using the command below. *The nice thing about the PigStorage class is that it helps us get the file name along with the *sentence in the file. At the end of this command, you will have (file_name, sentence) tuples to go *further. Please take care to change the hdfs load directory if necessary*/
  
  
 file_and_sentence = load ‘grants/*' using PigStorage('\t', '-tagsource') as (file_name: chararray, sentence: chararray);
  
  
 /*We now split each sentence using the TOKENIZE Pig function and flatten out the tokens we get from the split. At the end of this step, we get (filename, word1, word2,….). The tuple of words are broken down from the sentence.
 */
  
 file_and_words = foreach file_and_sentence generate file_name as file_name,flatten(TOKENIZE(sentence)) as words;
  
  
 filtered_file_and_words = filter file_and_words by (words matches '\\w+');
  
  
 /*We now group each file and the words of the sentence that it follows. A group operation in Pig gives us (group, {members of the group}). A flatten on the group will produce more tuples. Please refer to this document for the details. We are doing the following step for us to get neat tuples of the form (filename, word, count of word). http://pig.apache.org/docs/r0.9.1/basic.html#flatten
 */ 
  
 lowercased_file_and_word = foreach filtered_file_and_words generate file_name as file_name, LOWER(words) as word;
 
 
