Today I needed to download a bunch of files from a site. That's easy enough to do with a script and some regular expressions, and I already had a PHP script for doing just that. Turns out the site was being really slow, perhaps limiting bandwidth per connection.
When things are being slow, try parallel! Unfortunately, PHP doesn't have threads built in. I thought this rather strange--threads are really useful and PHP has every other useful thing built in. While threads are not build-in, the ability to fork is built in. Windows people can't use it, but that's not my problem.
I've never had to work with "fork" before--I've always had pthreads--but I understood the concept. Implementation, however, proved more difficult. I soon began to wish I had real threads. Anyway, I found a class called PHP_Fork by Luca Mariano. It's an older class that works with PHP 4, but functional.
Basically, I wanted to have a dispatcher thread (the main thread) and several worker threads. This is a text book example of using a semaphore actually. The semaphore count is set to the number of threads you want to have available. Then the dispatcher goes through it's list of work one at a time and pends the semaphore. When it gets one, it starts a thread to do the work. Repeat until finished. When it reaches the maximum number of threads, the semaphore pend will cause the loop to stall until one of the threads finishes. Each worker thread posts (releases) the semaphore when it finishes. When the dispatcher finishes, the finial part is to simply wait for the semaphore count to reach the total number of threads, meaning all the threads are finished. Then you are done. An example of this can be found in my image resizing script
, which launches one resizing thread for each CPU core on the system.
Now to attempt this using forks--no threads and no semaphores. First I created a pool of workers. Each worker does nothing until it gets singled that there is a job waiting for it. When it gets a job, it denotes the job is being worked on, does the job, then denotes it is again finished. So the dispatcher, rather then posting a semaphore, has to loop through and query all the worker forks to see if any of them are idle. If nothing is found, the dispatcher sleeps (i.e. wastes some time) for a bit and tries again. When a open fork is found, the dispatcher sends a new job to the forked process and continues on. When the dispatcher finishes, it again loops through all the worker threads and waits for them to signal that they are done. Not as efficient as having real threads and semaphores, but functional. Getting this to work was a job, but it works.
For an example of the code, look here
is the class by Luca Mariano. It looks like it uses PHP phpDocumentor and therefor is quite through in it's explanation. All I did was a quick hack job, so my code isn't as nice. CopyThreadPool.php
is the class for working with a copy thread queue. Simply create an instance with the number of copy threads desired and queue away. CopyExample.php
shows how this works. It takes some saved html page and downloads all the linked jpeg images using a max of 20 copy threads. Despite the site I was copying from limiting the bandwidth of individual connections, I was able to hold about 10 Mbit/sec using 20 copy threads, and that was a significant improvement.
Pictured is my sleepy Vinny cat soaking up a sunbeam.