New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard callback without profile_batch setting cause Errors: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES and CUPTI_ERROR_INVALID_PARAMETER #35860
Comments
Facing the same problem. |
Same issue. Different code sample. OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 |
Same (INSUFFICIENT PRIVILEGES):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 CUDA/cuDNN version: 10.1 Host: ttmagpie_d99d3f105d0a
|
@gawain-git-code , |
@airMeng @kevin-hartman @eduardofv Are you guys satisfy with the answer from @oanush ? I personally would not because that was just telling me I can only use profile_batch settings safely in google colab instead of on my own machine setup. This was not answering the root cause of the problem of why cupti_tracer is signalling the errors. But thank you @oanush for spending the time to help out. |
I am still receiving the error but have moved to another environment. It may have to do with driver updates on Ubuntu. Will try to check again and get back to you. |
The error states that
May be its an error with the configuration itself. |
Facing same issue when capturing profile data by tensorboard through gRPC: Tried following solution by nvidia (enable non privileged access to profile counters) - to no avail. my training runs as root inside a container (which was other solution suggested by nvidia). |
I have the same issue when I use the official TensorFlow docker image(tensorflow/tensorflow:2.1.0-gpu-py3). E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER So I go back to the 'tensorflow/tensorflow:2.0.0-gpu-py3' image. |
@dartlune Did going back to the 'tensorflow/tensorflow:2.0.0-gpu-py3' image helped? I cannot save the model in both versions of the docker image! The weird thing is that when running the model in jupyter notebook, it saves the model each iteration! But not in python3! Any suggestions? |
@tamaramiteva Actually I had some error with tensorflow/tensorflow:2.0.0-gpu-py3'. LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 |
Any update about this problem? |
This is due to NVIDIA CUPTI libary API change: https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti Also note that tensorboard profiler plugin got broken by Chrome 80 update, see tensorflow/tensorboard#3209 Suggested workaround works - run Chrome with --enable-blink-features=ShadowDOMV0,CustomElementsV0,HTMLImports flags like: |
Adding |
I have had the test on the above-mentioned solutions. It seems that there is no quick way to go out of the dilemma of the error "CUPTI_ERROR_INSUFFICIENT_PRIVILEGES" 1. The ad-hoc solution Even though Nvidia gives the temporal solution "CAP_SYS_ADMIN", it is a ad hoc solution. It sometimes works and does not work in the rest of the time. $ python abc.py --cap-add=CAP_SYS_ADMIN 2. LD_LIBRARY_PATH is not reliable The following solutions sometimes work ans does not work in the rest of the time. LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 or LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/extras/CUPTI/lib64 3. No path of "/etc/modprobe.d/nvidia-kernel-common.conf" The modprobe.d does not include the path iof "/etc/modprobe.d/nvidia-kernel-common.conf". So I could not add "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf Nvidia gives quite value explanation on the error. So it is quite strange. |
Thanks , It worked for me. |
None of the solutions offered here nor anywhere else has worked for me. Perhaps it may work if one upgrades from Ubuntu 16.04 to Ubuntu 18.04. But since I'm on a shared server, it may take some time to do the upgrade. I have not tried docker yet. OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 |
I am having the same error in anaconda environment. None of the solutions posted above work for me. Does anyone have any ideas what can be done? |
Tensorflow use NVIDIA provided libcupti for GPU tracing support. However since CUDA 10, that functionality requires CAP_SYS_ADMIN privilege, or you should change the /etc/modprobe.d/nvidia-kernel-common.conf (which also require sudo, but only once). I believe that NVIDIA enforce this restriction because of some research papers said you can steal user secrets by probing performance counters. |
Hey @trisolaran thanks for brief intro. The thing is i do not have /etc/modprobe.d/nvidia-kernel-common.conf such file. I am using a conda environment. |
@SarfarazHabib Hi I am using a conda enviroment too and I solved this problem by adding |
@kunihik0 Thanks alot for the help. The error is now gone but my training stucks after random epochs. I am using tensorflow 2.3 for now on ubuntu 18.04. Can anyone guide me in any direction with respect to this new problem ?? |
This solution works great for Windows 10 systems. But what about Windows Server 2019? It seems that now Microsoft requires you to get NVIDIA Control Panel from the Microsoft Store, but that is not available on Windows Server 2019. Is there an alternative way to allow these permissions on Windows Server 2019? |
This works for me after switching from conda to virtualenv and I also need to use OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04 Now I can profile with --profile_steps=1000, 1005, for example, 5 steps, but if I increase it to 10, there is this non-deterministic segfault appearing. Not sure whether this happened to anyone else? |
Yes, I get that segfault too – I think it's because the overhead of profiling, on top of regular GPU computations, causes GPU memory overflow. |
In order to run docker: |
@vlasenkoalexey Do you mean the version of NVIDIA CUPTI libary that changes API result in the error? Will the old version that API doesn't change run normally? |
CUPTI library is part of CUDA, before CUDA 10.x profiling didn't require admin privileges. See nvidia doc for details: https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti |
@vlasenkoalexey But CUDA 10.x is also troubled with the privileges problem in my local , so do many people under this issue. My local configuration: ubuntu 18.04 python3.7 cuda10.1/ cuda10.2 (two machines) |
Was able to reproduce the issue in TF v2.5 ,please find the gist here..Thanks ! |
I also met this problem.My OS is centos7, adding the conf file under /etc/modprobe.d/ and then rebuilding the inital RAM disk by |
Hi @gawain-git-code , Could you look at this thread for answer ? |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
how about cpu train? |
I'm running a tensorflow application in a Docker Container on a Windows Machine with WSL2. I get the following errors:
I changed the /etc/modprobe.d/nvidia-kernel-common.conf file as suggested and run the docker container as root-user. |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Describe the current behavior
When using tf.keras.callbacks.TensorBoard() without the profile_batch setting, it gives out errors of CUPTI_ERROR_INSUFFICIENT_PRIVILEGES and CUPTI_ERROR_INVALID_PARAMETER from tensorflow/core/profiler/internal/gpu/cupti_tracer.cc.
Describe the expected behavior
With profile_batch = 0, these two errors are gone.
But comes back when profile_batch = 1, or other non-zero values.
Code to reproduce the issue
Other info / logs
Train on 800 samples, validate on 200 samples
2020-01-14 21:30:27.591905: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-01-14 21:30:27.594743: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-01-14 21:30:27.599172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_101.dll
2020-01-14 21:30:27.704083: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2020-01-14 21:30:27.716790: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
Epoch 1/5
2020-01-14 21:30:28.370429: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-01-14 21:30:28.651767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-01-14 21:30:29.662864: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
2020-01-14 21:30:29.670282: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88] GpuTracer has collected 0 callback api events and 0 activity events.
800/800 [==============================] - 5s 6ms/sample - loss: 0.0011 - val_loss: 0.0011
Epoch 2/5
800/800 [==============================] - 3s 4ms/sample - loss: 8.5921e-04 - val_loss: 0.0010
Epoch 3/5
800/800 [==============================] - 3s 3ms/sample - loss: 8.5613e-04 - val_loss: 0.0010
Epoch 4/5
800/800 [==============================] - 3s 4ms/sample - loss: 8.5458e-04 - val_loss: 9.9713e-04
Epoch 5/5
800/800 [==============================] - 3s 4ms/sample - loss: 8.5345e-04 - val_loss: 9.8825e-04
The text was updated successfully, but these errors were encountered: