This project consists of a Python command-line application for scraping a Wiki of choice (in this case - Stardew Valley Wiki). This program is a basis for passing a Python course @ MIMUW
This project uses uv for managing dependencies.
To build the project, use uv build in the project root directory.
To run the project, use uv run wikiscraper in the project root directory.
To see the available options, use uv run wikiscraper --help.
The use of fallback google search is optional, and can be enabled by putting a free serpev.dev API key in the
.env file under the X-API-KEY key.
To run the unit tests, use uv run pytest in the project root directory.
The integration tests can be run using uv run python tests/integration_test.py.
The configuration of the project is done through the config.json file.
Available options:
wiki_url- the URL of the wiki to scraperequest_timeout- the timeout for HTTP requestsuser_agent- the user agent to use for HTTP requestsaccept-language- the language to use for HTTP requestsword_freq_lang- the language to use for word frequency analysis (wordfreq package)json_path- the path to the JSON file to save word frequencies tohtml_path- the path to the directory to read HTML files from (used in stardew_file mode)mode- the mode to use for scraping (see below)
Available modes:
stardew_normal- scrapes the wiki normally, through theStardewScraperclassstardew_file- scrapes the wiki using theStardewFileScraperclass (wrapper for theStardewScraper), which fetches HTML files from a directory instead of fetching them online
New modes can be added by creating a new class that inherits from WikiScraper and implementing its abstract methods.
This allows for custom scraping logic that can be used for different wikis or different scraping strategies.