Project Link: https://github.com/haard7/IR-Project-A20540508

Description:

  • There are three components developed in this projects which is aimed to articulate the end to end search engine functionality at small scale to understand how actually the web search engines work.

1) Crawler

  • It is Scrapy based crawler which download the html documents based on given URLs and parameters of max_depth and max_pages.

  • In the given project I have used wikipedia URLs, mostly related to renewable energy and power. basically I am downloading the bunch of html documents in -->crawler-->Data directory. those all html documents will be used in the next component to build the inverted index. Documents are downloaded with the name of last path of the url to keep track of the documents.

2) Indexer

  • It is sci-kit learn based indexer which create the inverted index on the data by parsing the html documents from the crawler.
  • It uses the functionality of Tf-IDf score and Cosine Similarity to create the inverted-index
  • In this components there are two files get generated. one, inverted_index.json which containes the postings corresponding to each term. second, content.json to store the document id and corrersponding document_name and Content which will also print in Flask based processor to debug whether our search results are working well or not.
  • You can also locally test the indexer by running python indexer.py and modifying the config.json to see the console output of the list of top-k documents being printed

3) Processor

  • It Flask based processor to print print the top-k results by performing the query validation/error checking and spelling correction. I have used NLTK for stopword removal and FuzzyWuzzy for spelling correction.
  • my flask app give the UI results of top-k resutls for searched queries. It also give the JSON documents of top-k results. It includes the document name, document ID and Content as well.