[PyCUDA] Context being sporadically destroyed when using multiple threads and contexts

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[PyCUDA] Context being sporadically destroyed when using multiple threads and contexts

Noah Young
I'm trying to run jobs on several GPUs at the same time using multiple threads, each with its own context. Sometimes this works flawlessly, but ~75% of the time I get a cuModuleLoadDataEx error telling me the context has been destroyed. What's frustrating is that nothing changes between failed and successful runs of the code. From what I can tell it's down to luck whether or not the error comes up:

~/anaconda3/lib/python3.6/site-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
    292 
    293         from pycuda.driver import module_from_buffer
--> 294         self.module = module_from_buffer(cubin)
    295 
    296         self._bind_module()

LogicError: cuModuleLoadDataEx failed: context is destroyed -

I start by making the contexts

from pycuda import driver as cuda
cuda.init()
contexts = []
for i in range(cuda.Device.count()):
    c = cuda.Device(i).make_context()
    c.pop()
    contexts.append(c)

... and setting up a function to use each context, i.e.

import numpy as np
def do_work(ctx):
    with Acquire(ctx):
        a = gpuarray.to_gpu(np.random.rand(100, 400, 400))
        b = gpuarray.to_gpu(np.random.rand(100, 400, 400))
        for _ in range(10):
            c = (a + b) / 2
        out = c.get()
    return out

where `Acquire` is a context manager that handles pushing and popping:

class Acquire:
    def __init__(self, context):
        self.ctx = context
    def __enter__(self):
        self.ctx.push()
        return self.ctx
    def __exit__(self, type, value, traceback):
        self.ctx.pop()

and here I run the code in parallel using a pool of threaded workers via joblib

from joblib import Parallel, delayed
pool = Parallel(n_jobs=len(contexts), verbose=8, prefer='threads')
with pool:
    # Pass 1
    sum(pool(delayed(do_work)(ctx) for ctx in contexts))
    # Pass 2
    sum(pool(delayed(do_work)(ctx) for ctx in contexts))

Note that I do several "passes" of work (I'll need to do 50 or so in my real application) with the same thread pool. It seems like the crash always happens somewhere in the second pass, or not at all. Any ideas about how to keep my contexts from getting destroyed?

System info
Ubuntu 16.04 (Amazon Deep Learning AMI)
CUDA driver version 396.44
4x V100 GPUs
Python 3.6
pycuda version 2018.1.1

_______________________________________________
PyCUDA mailing list
[hidden email]
https://lists.tiker.net/listinfo/pycuda
Reply | Threaded
Open this post in threaded view
|

Re: Context being sporadically destroyed when using multiple threads and contexts

Andreas Kloeckner
Noah Young <[hidden email]> writes:
> I'm trying to run jobs on several GPUs at the same time using multiple
> threads, each with its own context. Sometimes this works flawlessly, but
> ~75% of the time I get a cuModuleLoadDataEx error telling me the context
> has been destroyed. What's frustrating is that nothing changes between
> failed and successful runs of the code. From what I can tell it's down to
> luck whether or not the error comes up:

"Context destroyed" is akin to a segmentation fault on the CPU. You
should find evidence that your code performed an illegal access, e.g.,
using 'dmesg' in the kernel log. (If you see a message "NVRM Xid ...",
that points to the problem) My first suspicion would be a bug in your
code.

Andreas

_______________________________________________
PyCUDA mailing list
[hidden email]
https://lists.tiker.net/listinfo/pycuda

signature.asc (847 bytes) Download Attachment