-
-
Notifications
You must be signed in to change notification settings - Fork 271
[ENH] V1 → V2 API Migration - datasets #1608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
JATAYU000
wants to merge
369
commits into
openml:main
Choose a base branch
from
JATAYU000:dataset_resource
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 20 commits
Commits
Show all changes
369 commits
Select commit
Hold shift + click to select a range
d43cf86
implement _sync_api_config
geetu040 3e323ed
update tests with _sync_api_config
geetu040 9195fa6
rename config: timeout -> timeout_seconds
geetu040 5342eec
use timedelta for default ttl value
geetu040 adc0e74
update tests, adds v2/fallback
geetu040 bfb2d3e
add MinIOClient in TestBase
geetu040 707e1f1
publish,tag methods need testing
JATAYU000 cabaecf
fix linting for builder
geetu040 79cf49c
new migration tests
JATAYU000 5c8791a
Merge /1576
JATAYU000 85c1113
fix unbound variables: "code", "message"
geetu040 39bf86a
use requests.Session()
geetu040 7b66677
remove "timeout_seconds" entirely
geetu040 d2224c4
update/refactor tests
geetu040 9608c36
remove unused current_api_version from TestAPIBase
geetu040 f6bc7f7
make TestAPIBase inherit TestBase
geetu040 baa3a38
nits: test classes
geetu040 29c93d1
Review changes, new tests
JATAYU000 7674b3a
Merge bse migration
JATAYU000 ddb0774
Doc strings
JATAYU000 52b93fe
minor fix in _sync_api_config
geetu040 ec9477f
chore: rerun CI
geetu040 cea6188
delete mock, decorator
JATAYU000 3d4e84d
Merge base
JATAYU000 839bd33
delete url in test
JATAYU000 8417349
New test design
JATAYU000 10d134a
remove duplicates in _api/resources/__init__.py
geetu040 935f0f4
implement HTTPClient.download and add tests
geetu040 9514df8
add docstrings
geetu040 09f9ad6
Review changes
JATAYU000 0b52427
Merge base pr
JATAYU000 53bee94
update minio
geetu040 33b4ca0
make delay functions static
geetu040 a6b9a45
rename: retry_raise_e -> exception
geetu040 f924b32
use context-manager for requests.Session
geetu040 541b0f2
remove "assert response is not None"
geetu040 acb173f
verify checksum before caching
geetu040 3e8d1f0
update tests
geetu040 f83bdb5
minor fix in ResourceV1API.untag
geetu040 969c7d8
Merge branch 'main' into dataset_resource
JATAYU000 2a42712
remove cache.ttl
geetu040 001caad
replace config.cache.dir with config.cache_dir
geetu040 fb38a2d
make HTTPClient.cache compulsory
geetu040 03c4ca9
remove unused OpenMLCacheRequiredError
geetu040 8d708fd
implement and use TestAPIBase._create_resource
geetu040 4f75bba
make ResourceAPI.minio compulsory
geetu040 164f66f
Merge branch 'main' into migration
geetu040 c4dae43
rename: use_cache -> enable_cache; reset_cache -> refresh_cache
geetu040 36c20a2
use server config from TestBase
geetu040 ab3c1eb
tests: mock HTTP post calls to prevent race conditions
geetu040 0fc3c74
Merge bse_migration into dataset_resource
JATAYU000 741a66b
rename cache params
JATAYU000 81dff8d
Merge branch 'dataset_resource'
JATAYU000 27ac86f
Minio assertions, other reviews
JATAYU000 2a488ca
Merge branch 'main' into migration
geetu040 cbc7194
Merge base migration
JATAYU000 599c7e1
remove hardcoded server in TestHTTPClient.test_cache
geetu040 2867862
fix docstring in _resolve_default_cache_dir
geetu040 f09f3cd
fix docstring in ResourceAPI
geetu040 5f731ce
remove duplicates in __all__
geetu040 bad7842
remove ttl related code/docs
geetu040 aefdb38
remove delay methods in HTTPClient
geetu040 0f40b02
minor fix in _resolve_default_cache_dir
geetu040 7ac1672
update FallbackProxy
geetu040 6ac1dfe
simplify _backend creation
geetu040 62924c9
Merge branch 'main' into migration
geetu040 27696bb
req changes
satvshr 190face
resolve conflicts
satvshr 95daaa6
remove old config file
satvshr 7841ea8
added OPENML_TEST_SERVER_ADMIN_KEY_ENV_VAR
satvshr cc515aa
bug fixing
satvshr e6a92df
armagh fix
satvshr 1b8c22a
update content_type check
geetu040 fc839a6
Revert "make delay functions static"
geetu040 1c922af
Revert "remove delay methods in HTTPClient"
geetu040 ffa9ce9
Merge branch 'main' into migration
geetu040 a7b2d21
allow api_key=None
geetu040 27fe790
add tests for api_key=None
geetu040 8965112
update cache not found message
geetu040 72ea1a4
update docs for path in HTTPCache
geetu040 a696c49
remove elapsed from cached meta
geetu040 755636d
move self.headers to _HEADERS
geetu040 d07af34
fix indentation in docstrings of _resolve_default_cache_dir
geetu040 2d9c8ec
Update openml/_api/clients/http.py
geetu040 002b989
Merge branch 'main' into migration
geetu040 045d896
move _handle_delete_exception and_get_endpoint_name, legal_resources
geetu040 c437966
set HTTPClient.headers
geetu040 e27470a
remove main_tag
geetu040 d04d956
remove and merge TestAPIBase into TestBase
geetu040 9263f7f
minor change in TestHTTPClient.test_cache
geetu040 79dea29
make HTTPClient.request private
geetu040 f6497c2
Revert "update FallbackProxy"
geetu040 dce7f54
use st_ctime instead of st_ctime for cache refresh test
geetu040 40dd460
Merge branch 'main' into issue1564
geetu040 0fc917c
majore config refactor
geetu040 3d86b18
Merge branch 'pr-1577' into migration
geetu040 aba3d3e
update _config.py
geetu040 d99d54d
update test_openml_cache_dir_env_var
geetu040 dc22e3a
fix mutable SERVERS_REGISTRY
geetu040 7318573
update set_api_version for fallback
geetu040 29ef187
minor fix
geetu040 cf94c89
fixes for test_config
geetu040 298fbda
fixes in conftest urls
geetu040 9870502
update test_http.py
geetu040 33065c2
undo changes with test_openml_cache_dir_env_var
geetu040 76b92bb
fix server mode in test_config.py
geetu040 419edcb
move _HEADERS to confing
geetu040 cb6d937
add fixtures for migration tests
geetu040 8544c8a
update test_http.py with fixtures
geetu040 d4c413b
update test_versions.py
geetu040 fab1a15
update test_versions.py
geetu040 6392be8
Merge base-migration
JATAYU000 276324a
fix error message in HTTPClient.server
geetu040 73f7594
fixes in test_versions.py: use DummyTaskAPI instead of TaskAPI
geetu040 2ee7fa3
add clients in openml._backend
geetu040 4f37607
skip parquet env var
JATAYU000 c74754a
Merge base-migration
JATAYU000 2473208
Updated test,admin fixture
JATAYU000 7afb0e3
code qulity Reviews
JATAYU000 3b96559
Test fixes
JATAYU000 ea80785
remove unnecessary
JATAYU000 83a2e80
Fix mock delete
JATAYU000 9eb6c90
Exception review
JATAYU000 4be5bbd
fixes with openml.config.[server|apikey] leakage
geetu040 9027c01
remove unused fixtures: use_api_[v1|v2]
geetu040 c1efdeb
Merge base-mgration
JATAYU000 dd048d5
mock requests
JATAYU000 98041ed
skip v2 test for now
JATAYU000 e5461a9
add more config tests
geetu040 7d899a9
make SERVERS_REGISTRY private
geetu040 8587414
fix marker: uses_test_server->test_server
geetu040 23a3450
fix UserWarning
geetu040 ac28f82
update fixture: with_server
geetu040 4a66245
req changes
satvshr c762fb4
Merge branch 'issue1564' of https://github.com/satvshr/openml-python …
satvshr 77c21f2
Update openml/_api/clients/http.py
geetu040 eac24fc
Update tests/test_api/test_http.py
geetu040 2ed65fe
update test_get_uses_cached_response
geetu040 f3b07de
test_get_with_api_key
geetu040 29db3f1
use .arff instead of .bin in tests
geetu040 3b4e538
update test_download_creates_file to use md5_checksum
geetu040 8ac886b
update test_download_is_cached_on_disk
geetu040 305f4f0
update APIBackendBuilder
geetu040 b2bf164
Merge branch 'main' into migration
geetu040 e97e6c2
Update openml/_api/clients/http.py
geetu040 c66d73c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] aa54e8e
pre-commit fixes
geetu040 2d452d3
Merge branch 'main' into issue1564
geetu040 c235812
Merge branch 'main' into issue1564
fkiraly 39eb823
Trigger CI
satvshr 50eed37
Merge branch 'main' into migration
geetu040 7a000eb
Merge branch 'main' into issue1564
geetu040 79f6187
Merge branch 'main' into issue1564
geetu040 b1a9e7f
Merge branch 'pr-1577' into migration (merge conflicts)
geetu040 d716ecf
update server methods in config
geetu040 3c29e71
fix api-version leakage in tests
geetu040 b4ff0b2
remove unused migration code
geetu040 93155ee
debug ci: separate cache for each test-case
geetu040 d3cc9a7
update port for localhost
geetu040 a6b82f4
Revert "debug ci: separate cache for each test-case"
geetu040 3419973
rerun CI
geetu040 8de99b7
Merge branch 'main' into migration
geetu040 d0202b0
Merge base migration
JATAYU000 0fa9e3b
Fix tests for new test setup
JATAYU000 7d61107
create enum ServerMode
geetu040 1ecbbba
update config for ServerMode
geetu040 65472ed
update tests for ServerMode
geetu040 9219266
Update status_update
JATAYU000 44b48b5
udpate apikey in _TEST_SERVERS_LOCAL
geetu040 11b19de
skip v2 status_update
JATAYU000 4df12d3
Merge base migration
JATAYU000 77d2af2
skip v2 status_update
JATAYU000 04bc83b
fix: remove duplicate server name in cache path
geetu040 f926092
test: remove check for ":" since windows CI expects it
geetu040 8072e34
adds marker
JATAYU000 47464e9
Merge base migration
JATAYU000 f059e71
switch 1 worker test
JATAYU000 b6d5e31
Merge main
JATAYU000 4ee28f1
reduce test workers
JATAYU000 509b4c3
revert workers, static patch
JATAYU000 6385597
fixes
JATAYU000 5fea9c9
exists check in load
JATAYU000 4b43003
updating cache tests
JATAYU000 f01db35
Revert "updating cache tests"
JATAYU000 e10d776
fixes
JATAYU000 ba7edd8
update lazy_behaviour tests, helper functions on paths
JATAYU000 f003425
file lock
JATAYU000 713356e
Merge main
JATAYU000 92bc246
mock get
JATAYU000 f9dddac
Merge branch 'main' into dataset_resource
JATAYU000 b90e7c4
path updates
JATAYU000 4f3ec74
chore: rerun CI
JATAYU000 98616db
isolate njobs>1
JATAYU000 ed35e69
taggin exception
JATAYU000 4af9cbe
Merge main resolve conflicts
JATAYU000 1c4f946
new tests files, conftest
JATAYU000 8c1c205
debugger
JATAYU000 0d99b8d
conftest update
JATAYU000 c0871f3
add path for id 1
JATAYU000 6791fb6
debug
JATAYU000 4164607
Merge main
JATAYU000 55f13ad
debug
JATAYU000 d6fe96a
debug fixes
JATAYU000 c0b3377
debug fixes
JATAYU000 8d37464
debug fixes
JATAYU000 e79bb91
debug fixes
JATAYU000 db74277
update cache clearing
JATAYU000 fac0240
chore: rerun CI
JATAYU000 95c68c6
chore: rerun CI
JATAYU000 2b7df47
Force refresh cache
JATAYU000 b5836b9
inc tries for fork, debug
JATAYU000 5e34368
update HTTPClient
geetu040 17fc002
replace _http.download with _http.get
geetu040 55b3f11
undo change in HTTPClient.download signature
geetu040 8fe5941
HTTPClient.cache_path_from_response
geetu040 37526bb
delete previously added files for tests/files
geetu040 3996bdd
update cache files in tests/files
geetu040 b3e9ab1
update cache files in tests/files
geetu040 ac3b903
update HTTPClient for zip files
geetu040 94ed2a6
update: cache_path_from_response -> cache_path_from_url
geetu040 df8b4b8
remove all meta.json
geetu040 3778204
update conftest.py
geetu040 b6c4b91
update test_list_all_few_results_available
geetu040 2edd5d0
update test_lazy_loading_metadata
geetu040 8d88959
update test_lazy_loading_metadata
geetu040 5d104c1
udpate test_get_dataset_force_refresh_cache
geetu040 9dffaae
remove test_get_dataset_force_refresh_cache_clean_start
geetu040 fe7cf96
Revert "remove test_get_dataset_force_refresh_cache_clean_start"
geetu040 7ae536a
add http_client in TestBase
geetu040 4fc0616
update test_lazy_loading_metadata
geetu040 c1c5544
update _get_body_filename_from_path
geetu040 140830e
update test_lazy_loading_metadata
geetu040 af8810d
update test_get_dataset_force_refresh_cache
geetu040 959d56b
remove test_deletion_of_cache_dir_faulty_download
geetu040 4719013
Revert class labels
JATAYU000 b33a895
Update improper paths in tests
JATAYU000 e2eddc6
test debuger
JATAYU000 9f73cca
fixup! test debuger
JATAYU000 28c9946
log file paths
JATAYU000 b499d7c
remove logger
JATAYU000 b821c82
minio path method in config
JATAYU000 d434d45
update minio client, old test file
JATAYU000 2c68a32
Review fixes
JATAYU000 929288f
minio parent
JATAYU000 e130739
remove minio file from static
JATAYU000 56ec10a
remove pq test file
JATAYU000 18d3439
reviews updated
JATAYU000 74bf348
update to minio path util
JATAYU000 0de1eac
remove isolation on joblib test
JATAYU000 6f11d4e
added version for failed test
JATAYU000 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| from openml._api.runtime.core import APIContext | ||
|
|
||
|
|
||
| def set_api_version(version: str, *, strict: bool = False) -> None: | ||
| api_context.set_version(version=version, strict=strict) | ||
|
|
||
|
|
||
| api_context = APIContext() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from dataclasses import dataclass | ||
| from typing import Literal | ||
|
|
||
| DelayMethod = Literal["human", "robot"] | ||
|
|
||
|
|
||
| @dataclass | ||
| class APIConfig: | ||
| server: str | ||
| base_url: str | ||
| key: str | ||
| timeout: int = 10 # seconds | ||
|
|
||
|
|
||
| @dataclass | ||
| class APISettings: | ||
| v1: APIConfig | ||
| v2: APIConfig | ||
|
|
||
|
|
||
| @dataclass | ||
| class ConnectionConfig: | ||
| retries: int = 3 | ||
| delay_method: DelayMethod = "human" | ||
| delay_time: int = 1 # seconds | ||
|
|
||
| def __post_init__(self) -> None: | ||
| if self.delay_method not in ("human", "robot"): | ||
| raise ValueError(f"delay_method must be 'human' or 'robot', got {self.delay_method}") | ||
|
|
||
|
|
||
| @dataclass | ||
| class CacheConfig: | ||
| dir: str = "~/.openml/cache" | ||
| ttl: int = 60 * 60 * 24 * 7 # one week | ||
|
|
||
|
|
||
| @dataclass | ||
| class Settings: | ||
| api: APISettings | ||
| connection: ConnectionConfig | ||
| cache: CacheConfig | ||
|
|
||
|
|
||
| settings = Settings( | ||
| api=APISettings( | ||
| v1=APIConfig( | ||
| server="https://www.openml.org/", | ||
| base_url="api/v1/xml/", | ||
| key="...", | ||
| ), | ||
| v2=APIConfig( | ||
| server="http://127.0.0.1:8001/", | ||
| base_url="", | ||
| key="...", | ||
| ), | ||
| ), | ||
| connection=ConnectionConfig(), | ||
| cache=CacheConfig(), | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from openml._api.http.client import HTTPClient, MinIOClient | ||
|
|
||
| __all__ = ["HTTPClient", "MinIOClient"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,302 @@ | ||
| from __future__ import annotations | ||
|
|
||
| import contextlib | ||
| import shutil | ||
| import urllib | ||
| import urllib.parse | ||
| import zipfile | ||
| from collections.abc import Callable | ||
| from pathlib import Path | ||
| from typing import TYPE_CHECKING, Any | ||
| from urllib.parse import urlencode, urljoin, urlparse | ||
|
|
||
| import minio | ||
| import requests | ||
| from requests import Response | ||
| from urllib3 import ProxyManager | ||
|
|
||
| from openml.__version__ import __version__ | ||
| from openml._api.config import settings | ||
|
|
||
| if TYPE_CHECKING: | ||
| from openml._api.config import APIConfig | ||
|
|
||
| import openml.config | ||
| from openml.utils import ProgressBar | ||
|
|
||
|
|
||
| class CacheMixin: | ||
| @property | ||
| def dir(self) -> str: | ||
| return settings.cache.dir | ||
|
|
||
| @property | ||
| def ttl(self) -> int: | ||
| return settings.cache.ttl | ||
|
|
||
| def _get_cache_dir(self, url: str, params: dict[str, Any]) -> Path: | ||
| parsed_url = urlparse(url) | ||
| netloc_parts = parsed_url.netloc.split(".")[::-1] # reverse domain | ||
| path_parts = parsed_url.path.strip("/").split("/") | ||
|
|
||
| # remove api_key and serialize params if any | ||
| filtered_params = {k: v for k, v in params.items() if k != "api_key"} | ||
| params_part = [urlencode(filtered_params)] if filtered_params else [] | ||
|
|
||
| return Path(self.dir).joinpath(*netloc_parts, *path_parts, *params_part) | ||
|
|
||
| def _get_cache_response(self, cache_dir: Path) -> Response: # noqa: ARG002 | ||
| return Response() | ||
|
|
||
| def _set_cache_response(self, cache_dir: Path, response: Response) -> None: # noqa: ARG002 | ||
| return None | ||
|
|
||
|
|
||
| class HTTPClient(CacheMixin): | ||
| def __init__(self, config: APIConfig) -> None: | ||
| self.config = config | ||
| self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | ||
|
|
||
| @property | ||
| def server(self) -> str: | ||
| return self.config.server | ||
|
|
||
| @property | ||
| def base_url(self) -> str: | ||
| return self.config.base_url | ||
|
|
||
| @property | ||
| def key(self) -> str: | ||
| return self.config.key | ||
|
|
||
| @property | ||
| def timeout(self) -> int: | ||
| return self.config.timeout | ||
|
|
||
| def request( | ||
| self, | ||
| method: str, | ||
| path: str, | ||
| *, | ||
| use_cache: bool = False, | ||
| use_api_key: bool = False, | ||
| **request_kwargs: Any, | ||
| ) -> Response: | ||
| url = urljoin(self.server, urljoin(self.base_url, path)) | ||
|
|
||
| params = request_kwargs.pop("params", {}) | ||
| params = params.copy() | ||
| if use_api_key: | ||
| params["api_key"] = self.key | ||
|
|
||
| headers = request_kwargs.pop("headers", {}) | ||
| headers = headers.copy() | ||
| headers.update(self.headers) | ||
|
|
||
| timeout = request_kwargs.pop("timeout", self.timeout) | ||
| cache_dir = self._get_cache_dir(url, params) | ||
|
|
||
| if use_cache: | ||
| try: | ||
| return self._get_cache_response(cache_dir) | ||
| # TODO: handle ttl expired error | ||
| except Exception: | ||
| raise | ||
|
|
||
| response = requests.request( | ||
| method=method, | ||
| url=url, | ||
| params=params, | ||
| headers=headers, | ||
| timeout=timeout, | ||
| **request_kwargs, | ||
| ) | ||
|
|
||
| if use_cache: | ||
| self._set_cache_response(cache_dir, response) | ||
|
|
||
| return response | ||
|
|
||
| def get( | ||
| self, | ||
| path: str, | ||
| *, | ||
| use_cache: bool = False, | ||
| use_api_key: bool = False, | ||
| **request_kwargs: Any, | ||
| ) -> Response: | ||
| # TODO: remove override when cache is implemented | ||
| use_cache = False | ||
| return self.request( | ||
| method="GET", | ||
| path=path, | ||
| use_cache=use_cache, | ||
| use_api_key=use_api_key, | ||
| **request_kwargs, | ||
| ) | ||
|
|
||
| def post( | ||
| self, | ||
| path: str, | ||
| **request_kwargs: Any, | ||
| ) -> Response: | ||
| return self.request( | ||
| method="POST", | ||
| path=path, | ||
| use_cache=False, | ||
| use_api_key=True, | ||
| **request_kwargs, | ||
| ) | ||
|
|
||
| def delete( | ||
| self, | ||
| path: str, | ||
| **request_kwargs: Any, | ||
| ) -> Response: | ||
| return self.request( | ||
| method="DELETE", | ||
| path=path, | ||
| use_cache=False, | ||
| use_api_key=True, | ||
| **request_kwargs, | ||
| ) | ||
|
|
||
| def download( | ||
| self, | ||
| url: str, | ||
| handler: Callable[[Response, Path, str], Path] | None = None, | ||
| encoding: str = "utf-8", | ||
| ) -> Path: | ||
| response = self.get(url) | ||
| dir_path = self._get_cache_dir(url, {}) | ||
| dir_path = dir_path.expanduser() | ||
| if handler is not None: | ||
| return handler(response, dir_path, encoding) | ||
|
|
||
| return self._text_handler(response, dir_path, encoding) | ||
|
|
||
| def _text_handler(self, response: Response, path: Path, encoding: str) -> Path: | ||
| if path.is_dir(): | ||
| path = path / "response.txt" | ||
| path.parent.mkdir(parents=True, exist_ok=True) | ||
| with path.open("w", encoding=encoding) as f: | ||
| f.write(response.text) | ||
| return path | ||
|
|
||
|
|
||
| class MinIOClient(CacheMixin): | ||
| def __init__(self) -> None: | ||
| self.headers: dict[str, str] = {"user-agent": f"openml-python/{__version__}"} | ||
|
|
||
| def download_minio_file( | ||
| self, | ||
| source: str, | ||
| destination: str | Path | None = None, | ||
| exists_ok: bool = True, # noqa: FBT002 | ||
| proxy: str | None = "auto", | ||
| ) -> Path: | ||
| """Download file ``source`` from a MinIO Bucket and store it at ``destination``. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| source : str | ||
| URL to a file in a MinIO bucket. | ||
| destination : str | Path | ||
| Path to store the file to, if a directory is provided the original filename is used. | ||
| exists_ok : bool, optional (default=True) | ||
| If False, raise FileExists if a file already exists in ``destination``. | ||
| proxy: str, optional (default = "auto") | ||
| The proxy server to use. By default it's "auto" which uses ``requests`` to | ||
| automatically find the proxy to use. Pass None or the environment variable | ||
| ``no_proxy="*"`` to disable proxies. | ||
| """ | ||
| destination = self._get_cache_dir(source, {}) if destination is None else Path(destination) | ||
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| bucket, object_name = parsed_url.path[1:].split("/", maxsplit=1) | ||
| if destination.is_dir(): | ||
| destination = Path(destination, object_name) | ||
| if destination.is_file() and not exists_ok: | ||
| raise FileExistsError(f"File already exists in {destination}.") | ||
|
|
||
| destination = destination.expanduser() | ||
| destination.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| if proxy == "auto": | ||
| resolved_proxies = requests.utils.get_environ_proxies(parsed_url.geturl()) | ||
| proxy = requests.utils.select_proxy(parsed_url.geturl(), resolved_proxies) # type: ignore | ||
|
|
||
| proxy_client = ProxyManager(proxy) if proxy else None | ||
|
|
||
| client = minio.Minio(endpoint=parsed_url.netloc, secure=False, http_client=proxy_client) | ||
| try: | ||
| client.fget_object( | ||
| bucket_name=bucket, | ||
| object_name=object_name, | ||
| file_path=str(destination), | ||
| progress=ProgressBar() if openml.config.show_progress else None, | ||
| request_headers=self.headers, | ||
| ) | ||
| if destination.is_file() and destination.suffix == ".zip": | ||
| with zipfile.ZipFile(destination, "r") as zip_ref: | ||
| zip_ref.extractall(destination.parent) | ||
|
|
||
| except minio.error.S3Error as e: | ||
| if e.message is not None and e.message.startswith("Object does not exist"): | ||
| raise FileNotFoundError(f"Object at '{source}' does not exist.") from e | ||
| # e.g. permission error, or a bucket does not exist (which is also interpreted as a | ||
| # permission error on minio level). | ||
| raise FileNotFoundError("Bucket does not exist or is private.") from e | ||
|
|
||
| return destination | ||
|
|
||
| def download_minio_bucket(self, source: str, destination: str | Path | None = None) -> None: | ||
| """Download file ``source`` from a MinIO Bucket and store it at ``destination``. | ||
|
|
||
| Does not redownload files which already exist. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| source : str | ||
| URL to a MinIO bucket. | ||
| destination : str | Path | ||
| Path to a directory to store the bucket content in. | ||
| """ | ||
| destination = self._get_cache_dir(source, {}) if destination is None else Path(destination) | ||
| parsed_url = urllib.parse.urlparse(source) | ||
|
|
||
| # expect path format: /BUCKET/path/to/file.ext | ||
| _, bucket, *prefixes, _file = parsed_url.path.split("/") | ||
| prefix = "/".join(prefixes) | ||
|
|
||
| client = minio.Minio(endpoint=parsed_url.netloc, secure=False) | ||
|
|
||
| for file_object in client.list_objects(bucket, prefix=prefix, recursive=True): | ||
| if file_object.object_name is None: | ||
| raise ValueError(f"Object name is None for object {file_object!r}") | ||
| if file_object.etag is None: | ||
| raise ValueError(f"Object etag is None for object {file_object!r}") | ||
|
|
||
| marker = destination / file_object.etag | ||
| if marker.exists(): | ||
| continue | ||
|
|
||
| file_destination = destination / file_object.object_name.rsplit("/", 1)[1] | ||
| if (file_destination.parent / file_destination.stem).exists(): | ||
| # Marker is missing but archive exists means the server archive changed | ||
| # force a refresh | ||
| shutil.rmtree(file_destination.parent / file_destination.stem) | ||
|
|
||
| with contextlib.suppress(FileExistsError): | ||
| self.download_minio_file( | ||
| source=source.rsplit("/", 1)[0] | ||
| + "/" | ||
| + file_object.object_name.rsplit("/", 1)[1], | ||
| destination=file_destination, | ||
| exists_ok=False, | ||
| ) | ||
|
|
||
| if file_destination.is_file() and file_destination.suffix == ".zip": | ||
| file_destination.unlink() | ||
| marker.touch() | ||
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| from openml._api.resources.datasets import DatasetsV1, DatasetsV2 | ||
| from openml._api.resources.tasks import TasksV1, TasksV2 | ||
|
|
||
| __all__ = ["DatasetsV1", "DatasetsV2", "TasksV1", "TasksV2"] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_fileshould be renamed to_given it is never called, surprised ruff does not call it out.