TensorFlow+GPUでハマった話 - 飽きるまでやります。

概要

TensorFlowでGPUを使おうと思ったらハマりました。

環境

CUDA8.0
nvidia-smiでGTX1080tiが認識されているのは確認済み。

Thu May 10 14:17:40 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:05:00.0      On |                  N/A |
| 29%   51C    P0    63W / 250W |    143MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1292    G   /usr/bin/X                                      99MiB |
|    0      1980    G   compiz                                          41MiB |
+-----------------------------------------------------------------------------+

状況

TF=1.4 TF-gpu=1.4
プログラム中のTFのソフトマックス関数で予期せぬ引数でエラーが出る。
CUDA6.0を要求される。

ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory

TF=1.5 TF-gpu=1.5
プログラムは動くがTFがGPUを認識しない。

>>> from tensorflow.python.client import device_lib
  from ._conv import register_converters as _register_converters

>>> device_lib.list_local_devices()
2018-05-10 14:12:24.458617: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8095756809744274102
]

TF-gpu=1.6
CUDA9.0を要求される。

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

TF-gpu=1.0

>>> device_lib.list_local_devices()
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:05:00.0
Total memory: 10.91GiB
Free memory: 10.54GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6957358555731752284
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 10754218394
locality {
  bus_id: 1
}
incarnation: 7704713809916469342
physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0"
]

GPUを認識してくれた。いいやつ積んでますね。

まとめと予想

CUDAのバージョンとTensorFlowのバージョンがシビア。
CUDAのバージョン(libcublas.so)が複数入ってるらしい、6.0は入ってなかったが1.0や5.0は入っててそれがTensorFlow1.0と合致して動いたらしい。
TensorFlowが古すぎると仕様変更のせいかコードでエラーが出る。
CUDA8.0とTF-gpu1.5はエラーは出ないがGPUを認識しないからなんか不都合がある？翌日試したらTF-gpu1.5はCUDA9.0を要求した、謎

TF-gpu1.0でちゃんとGPUを認識したのでバージョンの相性なんだろうなという感想です。CUDA9.0とTF-gpu1.5を試してみたいけど他にPC使っている人がいて今県外なのでまた後日頑張ります。とりあえず勝手に9.0にアップデートしてみるかって思ってやってみたらすごくめんどくさそうだったので好き勝手出来なそうでした。

グラフィックドライバ周りは失敗すると画面が見えなくてとてもめんどくさいことになるからやりたくないなあというのが本音です…。

どうでもいいけど個人用にGTX1080ti欲しい！

追記

マシンのOSがubuntu14.04で、CUDA9.0は14.04をサポートしていないらしくてすごくめんどくさい予感がします。
とりあえず応急処置として

Keras2.0.0
TensorFlow-gpu1.0.0

で動いたのでメモ。
TensorFlow-gpuのバージョンを下げるとソフトマックス関数でエラーが出るのでKersaのバージョンを下げてみたら動きました。