Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating new HSDS job store, no jobs added #3970

Closed
edraizen opened this issue Dec 21, 2021 · 4 comments
Closed

Creating new HSDS job store, no jobs added #3970

edraizen opened this issue Dec 21, 2021 · 4 comments

Comments

@edraizen
Copy link

edraizen commented Dec 21, 2021

Hello,

I am moving away from AWS in favor of a local MinIO instance on our university cluster. However, the AWS S3 job store requires SDB, which I don't believe is available in MinIO. I have been using HSDS in my project so I thought it would be interesting to try and make a job store using it where MinIO handles the overlarge file storage. I think I might be missing something because no jobs get added and the stats and logger is constantly being written to the output. The S3/MinIO parts works with some modification and the HSDS part is the one with errors.

My new branch with the HSDSS3JobStore: https://github.com/edraizen/toil/tree/hsds.

If anyone has any suggestions on how to get this working or other alternatives to SDB, I would really appreciate it!

Thanks,
Eli

[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] group.getitem(files)
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] __get_link_json(files)
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] getObjByUuid(d-2e0c2ba1-ba058526-5da5-8a4220-725831)
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] use basic auth with username: [email protected]
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] GET: /datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831 [/home/ed4bu/cath-paper-test-0-jobstore.h5] bucket: None
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] GET: http://hsds.default.svc.cluster.local:5101/datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831 [/home/ed4bu/cath-paper-test-0-jobstore.h5]
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] status: 200
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] rsp_json: {'id': 'd-2e0c2ba1-ba058526-5da5-8a4220-725831', 'root': 'g-2e0c2ba1-ba058526-a684-a32932-8d0dae', 'shape': {'class': 'H5S_SIMPLE', 'dims': [10], 'maxdims': [0]}, 'type': {'class': 'H5T_COMPOUND', 'fields': [{'name': 'file_id', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}, {'name': 'ownerID', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}, {'name': 'encrypted', 'type': {'class': 'H5T_INTEGER', 'base': 'H5T_STD_I8LE'}}, {'name': 'version', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}, {'name': 'checksum', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}, {'name': 'numChunks', 'type': {'class': 'H5T_INTEGER', 'base': 'H5T_STD_I32LE'}}, {'name': 'chunkName', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}, {'name': 'chunkValue', 'type': {'class': 'H5T_STRING', 'length': 'H5T_VARIABLE', 'charSet': 'H5T_CSET_UTF8', 'strPad': 'H5T_STR_NULLTERM'}}]}, 'creationProperties': {'filters': [{'class': 'H5Z_FILTER_DEFLATE', 'id': 1, 'level': 9, 'name': 'gzip'}]}, 'layout': {'class': 'H5D_CHUNKED', 'dims': [8192]}, 'attributeCount': 0, 'created': 1640061185, 'lastModified': 1640061185, 'domain': '/home/ed4bu/cath-paper-test-0-jobstore.h5', 'hrefs': [{'rel': 'self', 'href': 'http://hsds.hdf.test/datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'root', 'href': 'http://hsds.hdf.test/groups/g-2e0c2ba1-ba058526-a684-a32932-8d0dae?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'home', 'href': 'http://hsds.hdf.test/?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'attributes', 'href': 'http://hsds.hdf.test/datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831/attributes?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'data', 'href': 'http://hsds.hdf.test/datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831/value?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}]}
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] req - cursor: 0 page_size: 10
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] query param: [0:10]
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] params: {'query': "ownerID=='bfcf5286-4bc7-41ef-a85d-9ab415b69d53'", 'select': '[0:10]'}
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] use basic auth with username: [email protected]
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] GET: /datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831/value [/home/ed4bu/cath-paper-test-0-jobstore.h5] bucket: None
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] GET: http://hsds.default.svc.cluster.local:5101/datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831/value [/home/ed4bu/cath-paper-tes
t-0-jobstore.h5]
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] GET params query:ownerID=='bfcf5286-4bc7-41ef-a85d-9ab415b69d53'
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] GET params select:[0:10]
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] status: 200
[2021-12-21T04:44:30+0000] [Thread-14 ] [D] [root] rsp_json: {'index': [], 'value': [], 'hrefs': [{'rel': 'self', 'href': 'http://hsds.hdf.test/datasets/d-2e0c2ba1-ba058526-5da5-8a
4220-725831/value?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'root', 'href': 'http://hsds.hdf.test/groups/g-2e0c2ba1-ba058526-a684-a32932-8d0dae?domain=/home/ed4bu
/cath-paper-test-0-jobstore.h5'}, {'rel': 'home', 'href': 'http://hsds.hdf.test/?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}, {'rel': 'owner', 'href': 'http://hsds.hdf.test/
datasets/d-2e0c2ba1-ba058526-5da5-8a4220-725831?domain=/home/ed4bu/cath-paper-test-0-jobstore.h5'}]}
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] got 0 rows
[2021-12-21T04:44:30+0000] [Thread-14 ] [I] [root] completed iteration, returning: 0 rows

Full logs: https://gist.github.com/edraizen/18dc8d00441a2b2324cdeaa98f33b9bf

┆Issue is synchronized with this Jira Task
┆friendlyId: TOIL-1112

@adamnovak
Copy link
Member

adamnovak commented Dec 21, 2021

Thanks for working on this!

@DailyDreaming has a PR #3569 that is trying to rip out SimpleDB altogether, which might be useful to look at, but it hasn't been touched in a while.

I noticed you did create HSDSJobStoreTest; do all the tests on that fail, or only some of them? That could give more clues as to what pieces are broken, versus just trying to run a workflow.

If all you want to do is not use SimpleDB, and MinIO's S3 implementation is strongly consistent, you can drop all the table stuff that the S3 job store uses and be more like the Google job store, and just store stuff actually in the bucket. We added SimpleDB over S3 when Amazon S3 wasn't strongly consistent, to enforce strong consistency and to speed things up a bit, but now Amazon S3 is strongly consistent, so we don't really need it anymore. If you're implementing something new it might be better to not use a database layer over the bucket layer at all.

As for the logs you sent, it looks like no jobs are ever being read out. (This makes sense, because the leader has a cache and only reads back jobs when it thinks they've been touched elsewhere.) They are logged as being appended by the batch save code:

[2021-12-21T04:33:21+0000] [MainThread] [D] [toil.jobStores.hsdsJobStore] jobs to add [('800f85f9-ae03-4450-ab3a-5349b389b5ec', '', 1, '000', 'Q0JaaDkxQVkmU1kpe13vAADt/+By3z8ARGP/4A05jIK///+iYASOADABamrLDU0mQnplPVDzVPUaG1DTI09TaQDIZMgNtKHqEVPyExTxJ6mp6ajTEZMg2iMBNGAg9I9CMNTEmiYp6ZE9TxNNQGTJhGgYQ0GR5Joeocjc4unrsMYC0jYdOr5RaM2Hzz0jFIhPkuKOvlPVsqOqpmVBCXRAoLwJpEYxPMqB6GBflU7W0xPwuhZC3hMEHqAHrG2xnFPXrSV8Xo11oAsSkLEktxkZulNXZ3PWvgPO8q37C9jE6Kc+Oix+Ckg7kBpCASokDPmj330cUTSRiOFKhgRuKbk1zLLI2uIdb7BKWwEL0mZGGI+VkDkAfQOwC2dwooYPTinQdsk1mUrrWTOyrE/xBBCsYGnhkYSpod0k39NGN0zg0jXh7yjVvJjZHEcammOZUP6Jti21mAtCJBUKRvIlECdmVPwWrr5kIupID+L85DxSTRMZrjPORBRCpWUNmJKRckhEEDdG4s7ngZAdqS8ghR3Las5JDPSU/0GcGZ25skqpEx97FoF7m0PGec1ColboTBERAUmalflb8Us+DPQsfv8iUPV4udpA4jRrD3iLgZb/i7kinChIFL2u94A=')]

But jobs never seem to be read or loaded after that. There is a message that it loaded jobs to iterate over and got [] but that comes earlier. So I don't actually see the job store doing anything wrong.

If you look at the leader messages, the last thing it says is:

[2021-12-21T04:33:24+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run

This is being swamped by constant debug-level logging of every HTTP request made for the stats-and-logging monitor, which spits out job logs as soon as they are saved. You might want to look into quieting that down somehow; we have machinery in Toil already to set higher log levels on internal Boto loggers, and maybe something like that is needed for the HSDS library logger.

But I think the real reason that there's no apparent forward progress is that the job went to the batch system, started running, and never stopped. What was happening to the Kubernetes Job object default-toil-69a1cf6a-9ecf-4c5b-97f1-cd82ce0deee5-0 and its associated pod while you were running this? Did kubectl logs <pod> show any complaints about not being able to talk to the job store from the worker? Or is it just a job that takes a while to run?

@edraizen
Copy link
Author

edraizen commented Dec 29, 2021

Thanks for the suggestions! I finished implementing the HSDSJobStoreTest and made sure they passed all tests.

Using just MinIO would probably be the way to go, but I'll work on this for now.

The other Kubernetes Job and it's associated pod are now giving an error:

DEBUG:toil.fileStores.cachingFileStore:Starting job ('helloWorld' 9fc7c5e7-37cd-4806-b36e-fc95a4f42054 v1) with ID (9fc7c5e7-37cd-4806-b36e-fc95a4f42054).
ERROR:root:Error talking to caching database: database is locked
WARNING:toil.lib.retry:Error in <function CachingFileStore._staticWrite at 0x7f03fdb645e0>: database is locked. Retrying after 1 s...
ERROR:root:Error talking to caching database: database is locked

Full worker logs here: https://gist.github.com/edraizen/84e45cc72fd004cb1b12f56dad9e0bf4

Would this have something to do with my Kubernetes setup? Here is my deployment file in case that helps: https://gist.github.com/edraizen/72b01873a41b497729c1b9814296590e

EDIT: It works correctly using the local batch system so it is an issue with Kubernetes

@adamnovak
Copy link
Member

The workaround here would be --disableCaching.

Is /media/smb-rivanna/ed4bu/UrfoldServer/urfold_runs/cath_pipeline/h5_restructure/k8/ on an unusual filesystem (like SMB) where SQLite database locking might behave unusually? Maybe you need the nobrl mount option?

TOIL_KUBERNETES_HOST_PATH really ought to be a directory local to the individual host, where the host has its local scratch space. It's meant for sharing caches between pods on a host, and setting it to a directory that's shared over the network across multiple hosts isn't ideal. All the local scratch files are then going to be going to and from the storage server all the time.

@adamnovak
Copy link
Member

It sounds like the HSDS job store works now, so I'm going to close this.

@edraizen I'm not sure we could take this into upstream Toil if you PR'd it, because we don't have the relevant setup and that would make it hard to put under CI. But Toil does have some plugin-registering support for batch systems that we could probably also make work for job stores, so if you want this working with upstream Toil we could build that out and help you make it a plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants