Speeding up shutil.copytree with multiprocessing

Posted:   |  More posts about python software

New to Python 3.2's implementation of shutil.copytree is the copy_function parameter, added in issue #1540112. This new parameter allows you to specify a function to be applied to each file in the tree, defaulting to shutil.copy2. I was thinking about a problem we have at work where our continuous integration server needs to setup a build environment with clean copies of our dependencies. To do this, we do a lot of shutil.copytree'ing, getting files from internal teams and some from projects like Boost and Xerces. It takes a long ass time to copy all of that stuff. Really long. Fortunately my work computer has 16 cores, so I thought, why not make the copytree function use more of my machine and go way faster? Sounds like a job for multiprocessing. Knowing I can use this new copy_function parameter to copytree, and knowing that multiprocessing.Pool is super easy to use, I put them together. [code lang="python"] def _copy_worker(copy_fn, src, dst): copy_fn(src, dst) class FastCopier(multiprocessing.Process): def __init__(self, procs=None, cli=False, copy_fn=copy2): """procs is the number of worker processes to use for the pool cli is True when this is being used on the command line and wants the cool progress updates. copy_fn is the function to use to carry out the actual copy.""" multiprocessing.Process.__init__(self) self.procs = procs if procs else multiprocessing.cpu_count() self.copy_fn = copy2 self.callback = self._copy_done if cli else None self._queue = multiprocessing.Queue() self._event = multiprocessing.Event() self._event.set() self._count = 0 def _copy_done(self, *args): """Called when _copy_worker completes if we're running as a command line application. Writes the current number of files copied.""" self._count += 1 sys.stdout.write("Copied %d files\r" % self._count) sys.stdout.flush() def run(self): pool = multiprocessing.Pool(processes=self.procs) try: while self._event.is_set(): try: src, dst = self._queue.get_nowait() except Empty: continue pool.apply_async(_copy_worker, (self.copy_fn, src, dst), callback=self.callback) # We get kicked out of the loop once we've exited the external # copy function, e.g., shutil.copytree. pool.close() except KeyboardInterrupt: print("Interrupted") finally: pool.join() def stop(self): self._event.clear() self._queue.close() def copy(self, src, dest): """Used as the copy_function parameter to shutil.copytree""" # Push onto the queue and let the pool figure out who does the work. self._queue.put_nowait((src, dest)) [/code] What we have here is a class that uses a multiprocessing.Queue and spreads out copy jobs using a multiprocessing.Pool. The class has a copy method which simply puts a source and destination pair into the queue, then one of the many workers will actually do the copy. The _copy_worker function at the very top is the target, which simply executes the copy2 call (or whatever copy variant you actually want to execute your copy). Putting this to use is pretty easy. Just create a FastCopier, then pass the copy method of FastCopier into shutil.copytree. As copytree works its way through your tree, it will call FastCopier.copy, which pushes into the queue, and the pool splits up the work. [code lang="python"] def fastcopytree(src, dest, procs=None, cli=False): """Copy `src` to `dest` using `procs` worker processes, defaulting to the number of processors on the machine. `cli` is True when this function is being called from a command line application. """ fc = FastCopier(procs, cli) fc.start() try: # Pass in our version of "copy", which just feeds into the pool. copytree(src, dest, copy_function=fc.copy) finally: fc.stop() fc.join() [/code] It's pretty fast. As an example, I copied my py3k checkout folder which has around 17,000 files and weighs around 1.7 GB. The baseline of using a single process does the copy in 458.958 seconds (on a crappy 7200 RPM drive). Using four processes completes the work in 120.243 seconds, and eight takes 128.336 seconds. Using the default of all cores, 16 in my case, takes 217.557 seconds, so you can see it drops off after the 4-8 range but it's still 2x faster. I haven't done much investigation since I'm pretty happy with a nearly 4x performance boost, but I'd like to do better, so maybe I'll post a followup. Why I think this is so cool: I'm sure there may be better and faster ways of solving this problem using many of the finely crafted modules out there, but this is available out of the box. This comes for free and it's available right now. Sure, this isn't the killer feature of Python 3.2, but I think it showcases the extensibility and the power of Python and the standard library. After toying with it for a while, I put the initial version of my findings here and called it copymachine. It's just a standalone script right now and has no tests (I know, I know), but I'll fiddle with it and you are more than welcome to as well. (disabled comments, sorry, spam got to be too much)

Contents © 2013 Brian Curtin