wikiscraper

This project consists of a Python command-line application for scraping a Wiki of choice (in this case - Stardew Valley Wiki). This program is a basis for passing a Python course @ MIMUW

Building and running

Building

This project uses uv for managing dependencies. To build the project, use uv build in the project root directory.

Running

To run the project, use uv run wikiscraper in the project root directory.

Usage

To see the available options, use uv run wikiscraper --help. The use of fallback google search is optional, and can be enabled by putting a free serpev.dev API key in the .env file under the X-API-KEY key.

Testing

To run the unit tests, use uv run pytest in the project root directory.

The integration tests can be run using uv run python tests/integration_test.py.

Configuration

The configuration of the project is done through the config.json file. Available options:

wiki_url - the URL of the wiki to scrape
request_timeout - the timeout for HTTP requests
user_agent - the user agent to use for HTTP requests
accept-language - the language to use for HTTP requests
word_freq_lang - the language to use for word frequency analysis (wordfreq package)
json_path - the path to the JSON file to save word frequencies to
html_path - the path to the directory to read HTML files from (used in stardew_file mode)
mode - the mode to use for scraping (see below)

Available modes:

stardew_normal - scrapes the wiki normally, through the StardewScraper class
stardew_file - scrapes the wiki using the StardewFileScraper class (wrapper for the StardewScraper), which fetches HTML files from a directory instead of fetching them online

New modes can be added by creating a new class that inherits from WikiScraper and implementing its abstract methods. This allows for custom scraping logic that can be used for different wikis or different scraping strategies.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
language_analysis		language_analysis
src		src
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
README.md		README.md
__init__.py		__init__.py
config.json		config.json
pyproject.toml		pyproject.toml
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiscraper

Building and running

Building

Running

Usage

Testing

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikiscraper

Building and running

Building

Running

Usage

Testing

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages