Following up two tutorial and summit days, then three days of the conference, the sprints got off to a great start on Sunday evening. I'm back at home now but wanted to put together a summary of the first two days: A lot of great projects got up on stage to pitch their sprint ideas including Brett Cannon speaking for CPython, letting people know where the sprint would be, mentioning the "dev-in-a-box" CDs, and encouraging people to come out and hack. Within 15 minutes of the end of announcements, we had 7 first-time sprinters eager to dive in and get going right away. The new developer guide was instrumental in getting everyone through the initial setup. The plan was to get a Mercurial checkout and coverage.py as a starting point, as one of the suggested sprint targets was increasing test coverage. By 6:30 on the first day, we were up to 9 people fully up and running, pouring over the coverage results (which were handily pre-generated on the "dev-in-a-box" CD), and diving into code. Here's what everyone worked on:
stringtests and got a patch written and checked in within first day.
urllibtests on his Mac and got to work on fixing them. After that he started on increasing
dbmpatch he made a few days before the sprint, then got it reviewed and checked in. He followed that up with test coverage patches to
test_email_codecsto Python 3. He's also working on
cgitbup to 75% test coverage by starting a test suite for it.
tarfilefor test coverage and a few bugs.
unittestissue and wrote up a fix plus tests that got checked in pretty quickly. He's also working on
pydocissue that was being discussed on the mailing list about named tuples
coverageand ended up diving into the order of imports on interpreter startup to fix coverage results before going further with them. He took the new results and is working on
Alright, alright, you guys win. Enough people emailed me to say they would use a 2.x version of watcher from my previous post, so here it is: version 0.2 now supports Python 2.
The changes are pretty simple. The "biggest" part of this happened in changeset 96f3f9e4511c, where I handle a few 2 and 3 specific parts split by
ifdefs. It's a few sections of handling Unicode/strings/bytes, and then a small change for 2.x to receive the action number as an
int rather than a
long. I think I did all of this correctly since it works, but that's a poor definition of "correctly" and my Unicode knowledge is definitely lacking.
I haven't done a ton of testing on it, but it seems to work alright in my simple test running between 2.7 and 3.1. If you have any issues with it, feel free to submit them or email me.
In the time I've been using Python, no project has started and stopped, and started and stopped again, more than my goal of writing a file system monitor. Sure, it's a small and simple project in the grand scheme of things that could be accomplished over that time, but I like to finish what I start.
The idea originally came from my father, also a Python user, suggesting something to work on, likely to help me learn but it'd also help him out. Years ago he wanted a multi-tabbed text editor with
tail -f functionality. I think I was reading through a wxPython book at the time and figured, "sure, I can learn this and make that tool." Started it up, had the shell of a simple GUI written, then came time to get the file system updates.
I probably got distracted by something, got hooked on something else, then totally forgot about the whole thing. For whatever reason, this happened every few months. About three years ago I tried to rejuvenate the whole thing and found Tim Golden's great "How do I..." page (pretty sure Dad sent this to me before). He has an example, three of them to be precise, covering exactly what I wanted to do: watch a directory for changes using Mark Hammond's pywin32. Awesome.
I got something coded up pretty quickly and took the library in a different direction, using it at work to write a Windows service that would monitor our servers and look for crash dumps and email the team. It was super simple and paid off big time, but I kinda just whipped it together and it was poorly designed.
Fast forward to a few months ago. I was bored and looking for something fun to work on -- ah, that file system watcher I've been half-assing for years. I thought to myself, "now that I actually know wtf I'm doing, I should do that, and I'm sure my Dad would get a kick out of it."
Somewhere in the middle of all of this I was writing C# and used the System.IO.FileSystemWatcher API, which was really nice. I've always wanted the same functionality in Python and liked what they had, so it would be cool to do what they did.
A few blogs around the web claimed the Win32 ReadDirectoryChangesW API was behind the scenes of
FileSystemWatcher. True or not, it made sense and I was familiar with that from the Tim Golden examples and my watcher service. I've been writing and reading a lot of C code lately so I started hacking.
After reading up on a few things, I came up with a much better C equivalent of what I had in that Windows service. It's multi-threaded, uses IO Completion Ports, and seemed to work pretty well. Pass in a directory and a callable, call the
start method, then you'll get callbacks for creating files, renaming files, etc. Sweet, we're on the way.
After fiddling around with that a bit, I figured it was good enough to build on. I started writing some tests and had simple things like the following working.
>>> import watcher
>>> import os
>>> callback = lambda action, path: print(action, path)
>>> w = watcher.Watcher(os.getcwd(), callback)
>>> w.flags = watcher.FILE_NOTIFY_CHANGE_FILE_NAME
>>> w.start() # Then I opened up vim and created a file called "hurf.durf"
That was cool and all, but I want to be able to follow one specific file, or files that match a certain pattern. I also want to be able to set callbacks for specific actions. Hmm,
FileSystemWatcher can do that. Maybe I'll just build out a clone and see how it works.
One of the first things I wanted to figure out was how to emulate the callback attaching and detaching like on Changed events. I needed a container that supplies
-=, which is none of them. Easy enough, just inherit from one and provide the
Before you get outraged: I know that's "unpythonic", but I'm going for a clone here.
Filling in the rest was pretty easy. There's a bunch of properties in
FileSystemWatcher that map to the attributes and methods of the underlying
Watcher. For example,
Watcher.flags, which is an OR'ed group of
NotifyFilters, which are constants exposed by
watcher from Win32.
The weirdest part of the whole thing is that starting and stopping
FileSystemWatcher is done by setting
False. It's not a method called
stop like in the underlying
Watcher (or anything else that needs to start and stop). It felt wrong perpetuating this weirdness, and again I know it's "unpythonic", but I'm going for a clone here.
As for translating
Watcher callbacks into
FileSystemWatcher callbacks that work with all of the fancy filtering, it's just a simple queue, a regex, and a big
Watcher calls its callback which puts the action and relative path into the queue.
FileSystemWatcher pulls it out, sees if it matches the filter, then we figure out from the action which callback to call. If it's a rename, do a special dance, but otherwise create an update object, fill in the details, then start calling back to the user.
>>> from FileSystemWatcher import FileSystemWatcher, NotifyFilters
>>> import os
>>> callback = lambda event: print(event.ChangeType, event.Name)
>>> fsw = FileSystemWatcher(os.getcwd())
>>> fsw.Created += callback
>>> fsw.NotifyFilter = NotifyFilters.FileName
>>> fsw.EnableRaisingEvents = True
>>> # Opened up Explorer and right clicked to create a new file
1 New Text Document.txt
There you have it. It took 235 lines of pure Python for
FileSystemWatcher and 466 lines of C for
watcher for this five year project to be completed. If any future employers are reading this, I'm capable of writing more than 140 lines of code per year to complete a five year project, I swear.
The project is now on PyPI under the name watcher, complete with a few binary installers. It's 3.x only because 2.x is dead, but I'll do a backport if people are interested (email me: first name at python.org).
The project is up on bitbucket: https://bitbucket.org/briancurtin/watcher. It's not really complete but it works pretty well for most usages. I know of a bunch of bugs that I'll eventually fix, but feel free to report more or even fix some of them.
Thanks for the idea, Dad.
New to Python 3.2's implementation of
shutil.copytree is the
copy_function parameter, added in issue #1540112. This new parameter allows you to specify a function to be applied to each file in the tree, defaulting to
I was thinking about a problem we have at work where our continuous integration server needs to setup a build environment with clean copies of our dependencies. To do this, we do a lot of
shutil.copytree'ing, getting files from internal teams and some from projects like Boost and Xerces. It takes a long ass time to copy all of that stuff. Really long. Fortunately my work computer has 16 cores, so I thought, why not make the
copytree function use more of my machine and go way faster? Sounds like a job for
Knowing I can use this new
copy_function parameter to
copytree, and knowing that
multiprocessing.Pool is super easy to use, I put them together.
def _copy_worker(copy_fn, src, dst):
def __init__(self, procs=None, cli=False, copy_fn=copy2):
"""procs is the number of worker processes to use for the pool
cli is True when this is being used on the command line and wants
the cool progress updates.
copy_fn is the function to use to carry out the actual copy."""
self.procs = procs if procs else multiprocessing.cpu_count()
self.copy_fn = copy2
self.callback = self._copy_done if cli else None
self._queue = multiprocessing.Queue()
self._event = multiprocessing.Event()
self._count = 0
def _copy_done(self, *args):
"""Called when _copy_worker completes if we're running as a command
line application. Writes the current number of files copied."""
self._count += 1
sys.stdout.write("Copied %d files\r" % self._count)
pool = multiprocessing.Pool(processes=self.procs)
src, dst = self._queue.get_nowait()
pool.apply_async(_copy_worker, (self.copy_fn, src, dst),
# We get kicked out of the loop once we've exited the external
# copy function, e.g., shutil.copytree.
def copy(self, src, dest):
"""Used as the copy_function parameter to shutil.copytree"""
# Push onto the queue and let the pool figure out who does the work.
What we have here is a class that uses a
multiprocessing.Queue and spreads out copy jobs using a
multiprocessing.Pool. The class has a
copy method which simply puts a source and destination pair into the queue, then one of the many workers will actually do the copy. The
_copy_worker function at the very top is the target, which simply executes the
copy2 call (or whatever
copy variant you actually want to execute your copy).
Putting this to use is pretty easy. Just create a
FastCopier, then pass the
copy method of
copytree works its way through your tree, it will call
FastCopier.copy, which pushes into the queue, and the pool splits up the work.
def fastcopytree(src, dest, procs=None, cli=False):
"""Copy `src` to `dest` using `procs` worker processes, defaulting to the
number of processors on the machine.
`cli` is True when this function is being called from a command line
fc = FastCopier(procs, cli)
# Pass in our version of "copy", which just feeds into the pool.
copytree(src, dest, copy_function=fc.copy)
It's pretty fast. As an example, I copied my py3k checkout folder which has around 17,000 files and weighs around 1.7 GB. The baseline of using a single process does the copy in 458.958 seconds (on a crappy 7200 RPM drive). Using four processes completes the work in 120.243 seconds, and eight takes 128.336 seconds. Using the default of all cores, 16 in my case, takes 217.557 seconds, so you can see it drops off after the 4-8 range but it's still 2x faster. I haven't done much investigation since I'm pretty happy with a nearly 4x performance boost, but I'd like to do better, so maybe I'll post a followup.
Why I think this is so cool:
I'm sure there may be better and faster ways of solving this problem using many of the finely crafted modules out there, but this is available out of the box. This comes for free and it's available right now. Sure, this isn't the killer feature of Python 3.2, but I think it showcases the extensibility and the power of Python and the standard library.
After toying with it for a while, I put the initial version of my findings here and called it
copymachine. It's just a standalone script right now and has no tests (I know, I know), but I'll fiddle with it and you are more than welcome to as well.
(disabled comments, sorry, spam got to be too much)