Next up, I need to figure out how to feed it html pages. I could just feed it straight up html files, but I ultimately want to word count blogs and such. On to the next!word_count = {}File.open('test_input.txt', 'r') do |f1|while line = f1.getswords = line.split(" ")words.each do |word|word = word.downcaseif !word_count.has_key?(word) #word not in hash?word_count[word] = 1 #add it and count itelseword_count[word] += 1 #only increment countend #if it's thereendendend
Sometime later, we rejoin our hero:
So I added a line to allow me to alternatively feed the wc any file I like via the command line:
File.open( ARGV[0]? ARGV[0] : 'default.txt', 'r' ) doI'm using ARGV[0] ? to check to see if there are any arguments sent along with the request to run the script. If there are, we use them. If not, we use my creatively-named 'default.txt' so that we don't blow a gasket and throw an error for not having anything to work with. I snagged some HTML from a random website and fed it to the wc program.
Next up: parsing out the tags so that all we're left with is the actual content of the site. After that, I need to figure out how to get the generated HTML in the first place. I've heard of screen scrapers (and usually not in a positive way) but I think that's what I need to build here. Ultimately, I would like to give this little program the urls for two different websites and have it compare the two. I'm a long way from there, but it's nice to have a goal. :)
No comments:
Post a Comment
Comments? Questions? Complaints? Coladas?