Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tony-core runtime error #675

Open
tonywang-sh opened this issue Jul 26, 2022 · 14 comments
Open

tony-core runtime error #675

tonywang-sh opened this issue Jul 26, 2022 · 14 comments

Comments

@tonywang-sh
Copy link

There are error messages about tony.TonyClient when runing tony task on yarn and hadoop 3.2.2. The error messages are as the below. How to deal with these errors?

2022-07-26 06:35:41,245 WARN ipc.Client: Exception encountered while connecting to the server
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy17.getTaskInfos(Unknown Source)
at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy18.getTaskInfos(Unknown Source)
at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:82)
at com.linkedin.tony.TonyClient.updateTaskInfoAndReturn(TonyClient.java:1192)
at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:1046)
at com.linkedin.tony.TonyClient.run(TonyClient.java:225)
at com.linkedin.tony.TonyClient.start(TonyClient.java:1293)
at java.lang.Thread.run(Thread.java:748)

@zuston
Copy link
Member

zuston commented Jul 29, 2022

If u submit tony app to secured cluster, the machine must be certified, which means keytab or principle must be provided.

I think you could use this machine to submit spark app for test. If it's ok, the tony app also can be submitted to cluster.

@tonywang-sh
Copy link
Author

Thanks for your reply. The cluster is hadoop 3.2.2 with kerberos, and I tried spark example successfully. I tried minist-tensorflow example according to the guide, https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow, but it failed. Do I need any other setting or configuration for this task?

@zuston
Copy link
Member

zuston commented Jul 29, 2022

Please attach the detailed error log and submit cli command args/ tony.xml and so on.

@tonywang-sh
Copy link
Author

tonywang-sh commented Aug 3, 2022

cli command:
#!/usr/bin/env bash
java -cp hadoop classpath:/data/tony-dist/tony-cli-0.5.3-uber.jar com.linkedin.tony.cli.ClusterSubmitter
--python_venv=/data/venv/myvenv.zip
--src_dir=/data/tony-dist/mnist-tensorflow
--executes=mnist_distributed.py \ # relative path inside src/
--task_params="--steps 1000 --data_dir /user/test/tony/data --working_dir /user/test/tony/model" \ # You can use your HDFS path here.
--conf_file=/data/tony-dist/tony.xml
--python_binary_path=venv/bin/python # relative path inside venv.zip

tony.xml,
image

error logs as the below:
AM Container for appattempt_1657011602166_1367_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2022-08-03 13:41:09.319]Exception from container-launch.
Container id: container_e94_1657011602166_1367_02_000001
Exit code: 1
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is test
main : requested yarn user is test
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data1/yarn/nm/nmPrivate/application_1657011602166_1367/container_e94_1657011602166_1367_02_000001/container_e94_1657011602166_1367_02_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
[2022-08-03 13:41:09.321]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of amstderr.log :
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more

@zuston
Copy link
Member

zuston commented Aug 8, 2022

Is the same problem? #672

It looks the nodemanager machine don't have the complete hadoop environment.

@tonywang-sh
Copy link
Author

tonywang-sh commented Aug 8, 2022

Got it, I have updated hadoop environment, and it reported python error as the below.
image

The error: ModuleNotFoundError: No module named 'contextlib'

@zuston
Copy link
Member

zuston commented Aug 8, 2022

You should package your pyenv zip at linux system machine same as the NM system. @tonywang-sh

@tonywang-sh
Copy link
Author

tonywang-sh commented Aug 9, 2022

My package pyenv is set at ubuntu 18.04 system with anaconda according to the guide https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow. Do you have another guide about setting up nomachine system package env to package this pyenv zip? Thanks.

@zuston
Copy link
Member

zuston commented Aug 9, 2022

Conda is also OK. If you want to check whether the env is OK, you could launch it in local machine.

@tonywang-sh
Copy link
Author

I used anaconda to package virtualenv python and obtained virtualenv pyenv zip, but this pyenv zip can not work at worker nodes. Is it right method?

@zuston
Copy link
Member

zuston commented Aug 9, 2022

Does this pyenv can be used in your local machine? You'd better to pre-check

@tonywang-sh
Copy link
Author

It worked in local machine by using "ven/bin/python " cmd line, but failed in remote worker node by submitting task with TonY script.

@zuston
Copy link
Member

zuston commented Aug 10, 2022

I guess this is caused by your local machine' env is not consistent with the nodemanager.

@tonywang-sh
Copy link
Author

If pyenv is packaged by virtualenv or anaconda, does it need to activate this pyenv python environment at the worker node, such as the comand, 'venv/bin/activate' before the task start at the worker. But I didn't find this "activate" operation in TonY project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants