Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload to Datastore errors when DOWNLOAD_PREVIEW_ONLY=True #101

Open
twdbben opened this issue Jun 26, 2023 · 2 comments
Open

Upload to Datastore errors when DOWNLOAD_PREVIEW_ONLY=True #101

twdbben opened this issue Jun 26, 2023 · 2 comments

Comments

@twdbben
Copy link

twdbben commented Jun 26, 2023

Describe the bug

When using datapusher-plus-docker to run datapusher-plus, with the following config parameters set:

PREVIEW_ROWS=1000
ADD_SUMMARY_STATS_RESOURCE=True
SUMMARY_STATS_WITH_PREVIEW=True
DOWNLOAD_PREVIEW_ONLY=True

There appears to be a problem with having DOWNLOAD_PREVIEW_ONLY=True

Setting DOWNLOAD_PREVIEW_ONLY=False fixes the errors I'm seeing.

With DOWNLOAD_PREVIEW_ONLY=True, when I try to push a resource to DP+, I get errors.

These are my test resource files that are failing:

TCEQ-TEST.xlsx
TCEQ-TEST.csv

When I push the attached XLSX file, I get the this error:

datapusher-plus  | --- Logging error ---
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 630, in push_to_datastore
datapusher-plus  |     qsv_excel = subprocess.run(
datapusher-plus  |   File "/usr/lib/python3.10/subprocess.py", line 524, in run
datapusher-plus  |     raise CalledProcessError(retcode, process.args,
datapusher-plus  | subprocess.CalledProcessError: Command '['/usr/local/bin/qsvdp', 'excel', '/tmp/tmp8s4qgo7c.XLSX', '--sheet', '
0', '--trim', '--output', '/tmp/tmp7ns3tj6h.csv']' returned non-zero exit status 1.                                               datapusher-plus  |
datapusher-plus  | During handling of the above exception, another exception occurred:
datapusher-plus  |
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/usr/lib/python3.10/logging/handlers.py", line 1057, in emit                                           datapusher-plus  |     smtp = smtplib.SMTP(self.mailhost, port, timeout=self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 255, in __init__
datapusher-plus  |     (code, msg) = self.connect(host, port)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 341, in connect
datapusher-plus  |     self.sock = self._get_socket(host, port, self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 312, in _get_socket
datapusher-plus  |     return socket.create_connection((host, port), timeout,
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 845, in create_connection
datapusher-plus  |     raise err
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 833, in create_connection
datapusher-plus  |     sock.connect(sa)
datapusher-plus  | ConnectionRefusedError: [Errno 111] Connection refused
datapusher-plus  | Call stack:
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
datapusher-plus  |     self._bootstrap_inner()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
datapusher-plus  |     self.run()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 953, in run
datapusher-plus  |     self._target(*self._args, **self._kwargs)
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
datapusher-plus  |     work_item.run()
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
datapusher-plus  |     result = self.fn(*self.args, **self.kwargs)
datapusher-plus  |   File "/usr/lib/ckan/dpplus_venv/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run
_job
datapusher-plus  |     retval = job.func(*job.args, **job.kwargs)
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 646, in push_to_datastore
datapusher-plus  |     logger.error(
datapusher-plus  | Message: "Upload aborted. Cannot export spreadsheet(?) to CSV: Command '['/usr/local/bin/qsvdp', 'excel', '/tmp
/tmp8s4qgo7c.XLSX', '--sheet', '0', '--trim', '--output', '/tmp/tmp7ns3tj6h.csv']' returned non-zero exit status 1."
datapusher-plus  | Arguments: ()
datapusher-plus  | 2023-06-26 16:53:33,176 WARNING Is the file encrypted or is not a spreadsheet?
datapusher-plus  | FILE ATTRIBUTES: /tmp/tmp8s4qgo7c.XLSX: Microsoft Excel 2007+

When I try with the same XLSX file converted to a CSV, I get the following error:

datapusher-plus  | 2023-06-26 17:02:57,076 INFO Fetching from: http://192.168.7.200:5000/dataset/8cbdffdb-1cef-4c9d-84fd-005fde129
962/resource/9af29c46-4f37-4c8f-9021-09bf7af88f9b/download/tceq-test.csv...
datapusher-plus  | 127.0.0.1 - - [26/Jun/2023:17:02:57 +0000] "GET /job/3b1c2e8d-29de-4d65-87b8-e3d800129cfe HTTP/1.1" 200 1111 "-
" "python-requests/2.25.1"
datapusher-plus  | 2023-06-26 17:02:57,161 INFO Downloading only first 1,000 row preview from 5.31MB file...
datapusher-plus  | 2023-06-26 17:02:57,170 INFO Fetched 0.09MB file in 0.09 seconds.
datapusher-plus  | 2023-06-26 17:02:57,177 INFO ANALYZING WITH QSV..
datapusher-plus  | 2023-06-26 17:02:57,184 INFO Normalizing/UTF-8 transcoding CSV...
datapusher-plus  | Invalid CSV. Last valid row (4): CSV error: record 4 (line: 5, byte: 446): found record with 23 fields, but the
 previous record has 3 fields
datapusher-plus  | 2023-06-26 17:02:57,237 ERROR Job aborted as the file cannot be normalized/transcoded: Command '['/usr/local/bi
n/qsvdp', 'input', '/tmp/tmpso60e8jy..csv', '--trim-headers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1
..
datapusher-plus  | --- Logging error ---
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 692, in push_to_datastore
datapusher-plus  |     subprocess.run(
datapusher-plus  |   File "/usr/lib/python3.10/subprocess.py", line 524, in run
datapusher-plus  |     raise CalledProcessError(retcode, process.args,
datapusher-plus  | subprocess.CalledProcessError: Command '['/usr/local/bin/qsvdp', 'input', '/tmp/tmpso60e8jy..csv', '--trim-head
ers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1.
datapusher-plus  |
datapusher-plus  | During handling of the above exception, another exception occurred:
datapusher-plus  |
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/usr/lib/python3.10/logging/handlers.py", line 1057, in emit
datapusher-plus  |     smtp = smtplib.SMTP(self.mailhost, port, timeout=self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 255, in __init__
datapusher-plus  |     (code, msg) = self.connect(host, port)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 341, in connect
datapusher-plus  |     self.sock = self._get_socket(host, port, self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 312, in _get_socket
datapusher-plus  |     return socket.create_connection((host, port), timeout,
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 845, in create_connection
datapusher-plus  |     raise err
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 833, in create_connection
datapusher-plus  |     sock.connect(sa)
datapusher-plus  | ConnectionRefusedError: [Errno 111] Connection refused
datapusher-plus  | Call stack:
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
datapusher-plus  |     self._bootstrap_inner()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
datapusher-plus  |     self.run()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 953, in run
datapusher-plus  |     self._target(*self._args, **self._kwargs)
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
datapusher-plus  |     work_item.run()
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
datapusher-plus  |     result = self.fn(*self.args, **self.kwargs)
datapusher-plus  |   File "/usr/lib/ckan/dpplus_venv/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run
_job
datapusher-plus  |     retval = job.func(*job.args, **job.kwargs)
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 706, in push_to_datastore
datapusher-plus  |     logger.error(
datapusher-plus  | Message: "Job aborted as the file cannot be normalized/transcoded: Command '['/usr/local/bin/qsvdp', 'input', '
/tmp/tmpso60e8jy..csv', '--trim-headers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1.."
datapusher-plus  | Arguments: ()
datapusher-plus  | 127.0.0.1 - - [26/Jun/2023:17:03:01 +0000] "GET /job/3b1c2e8d-29de-4d65-87b8-e3d800129cfe HTTP/1.1" 200 2217 "-
" "python-requests/2.25.1"
@jqnatividad
Copy link
Contributor

Thanks @twdbben for the report.

DOWNLOAD_PREVIEW_ONLY was created specifically for the Data Hub, so you don't need to download huge files, only to get the PREVIEW_ROWS only.

Apparently, I'll have to make partial download analysis more robust.

For now, just leave DOWNLOAD_PREVIEW_ONLY=False.

@jqnatividad jqnatividad transferred this issue from dathere/datapusher-plus-docker Jun 26, 2023
@twdbben
Copy link
Author

twdbben commented Jun 26, 2023

@jqnatividad Another problem we were seeing with leaving DOWNLOAD_PREVIEW_ONLY=True was that the max row count would always be set to whatever PREVIEW_ROWS is set to instead of the actual record count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants