.Net – the Void

Async I/O and ThreadPool Deadlock (Part 5)

Apr 082013

In part 4 we’ve replaced Parallel.ForEach with Task. This allowed us run the processes on the ThreadPool threads, but waited only in the main thread. So the worker threads were never blocked on waits. They executed the child process, then processed the output callbacks and exited promptly. All the while, the main thread is waiting for everything to finish completely. To do this, we needed to both break the Run logic from the Wait. In addition, we had to keep the Process instances alive for the callbacks to work and we can detect the end correctly.

This worked well for us. Except for the nagging problem that we can’t use Parallel.ForEach. Or rather, if we did, even accidentally, we’d deadlock as we did before. Also, our wrapper isn’t readily reusable in other classes without explicitly separating the Run call from the Wait on separate threads as we did. Clearly this isn’t a bulletproof solution.

It might be tempting to think that the Process class is a thin wrapper around the OS process, but in fact it’s rather complex. It does a lot for us, and we shouldn’t throw it away. It’d be great if we could avoid the problem with ThreadPool altogether. Remember the reason we’re using it was because the synchronous version deadlocked as well.

What if we could improve the synchronous version?

Synchronous I/O Take 2

Recall that the deadlock happened because to read until the end of the output stream, we need to wait for the process to exit. The process, in its turn, wouldn’t exit until it has written all its output, which we’re reading. Since we’re reading only one stream at a time (either StandardOutput or StandardError,) if the buffer of the one we’re not reading gets full, we deadlock.

The solution would be to read a little from each. This would work, except for the little problem that if we read from a stream that doesn’t have data, we’d block until it gets data. This is exactly like the situation we were trying to avoid. A deadlock will happen when the other stream’s buffer is full, so the child process will block on writing to it, while we are waiting to read from the other stream that has no data yet.

Peek to the Rescue?

The Process class exposes both StandardOutput or StandardError streams, which have a Peek() function that returns the next character without removing it from the buffer. It returns -1 when we have no data to read, so we postpone reading until Peek() returns > -1.

Albeit, this won’t work. As many have pointed out, StreamReader.Peek can block! Which is ironic, considering that one would typically use it to poll the stream.

It seems we have no more hope in the synchronous world. We have no getters to query the available data as in NetworkStream.DataAvailable and length will throw an exception as it needs a seekable stream (which we haven’t). So we’re back to the Asynchronous world.

The Solution: No Solution!

I was almost sure I found an answer with direct access to the StandardOutput or StandardError streams. After all, these are just wrappers around the Win32 pipes. The async I/O in .Net is really a wrapper around the native Windows async infrastructure. So, in theory, we don’t need all the layers that Process and other class add on top of the raw interfaces. Asynchronous I/O in Windows works by passing an (typically manual-reset) event object and a callback function. All we really need is the event to get triggered. Lo and behold, we are also get a wait-handle in the IAsyncResult that BeginRead of Stream returns. So we could wait on it directly, as these are triggered by the FileSystem drivers, after issuing async reads like this:

  var outAsyncRes = process.StandardOutput.BaseStream.BeginRead(outBuffer, 0, BUFFER_LEN, null, null);
  var errAsyncRes = process.StandardError.BaseStream.BeginRead(errBuffer, 0, BUFFER_LEN, null, null);
  var events = new[] { outAsyncRes.AsyncWaitHandle, errAsyncRes.AsyncWaitHandle };
  WaitHandle.WaitAny(events, waitTimeMs);

Except, this wouldn’t work. There are two reasons why this doesn’t work, one blame goes to the OS and one to .Net.

Async Event Not Triggered by OS

The first issue is that Windows doesn’t always signal this even. You read that right. In fact, a comment in the FileStream code reads:

                // Consider uncommenting this someday soon - the EventHandle 
                // in the Overlapped struct is really useless half of the
                // time today since the OS doesn't signal it. [...]

True, the state of the event object is not changed if the operation finishes before the function returns.

.Net Callback Interop via ThreadPool

Because the event isn’t signaled in all cases, .Net needs to manually signal this event object when the Overlapped I/O callback is invoked. You’d think that would save the day. Albeit, the Overlapped I/O callback doesn’t call into managed code. The CLR handles such callbacks in a special way. In fact it knows about File I/O and .Net wrappers aren’t written in pure P/Invoke, but rather by support from the CLR as well as standard P/Invoke.

Because the system can’t invoke managed callbacks, the solution is for the CLR to do it itself. Of course this needs to be one in a responsive fashion, and without blocking all of CLR for each callback invocation What better solution than to queue a task on the ThreadPool that invokes the .Net callback? This callback will signal the event object we got in the IAsyncResult and, if set, it’ll call our delegate that we could pass to the BeginRead call.

So we’re back 180 degrees to ThreadPool and the original dilemma.

Conclusion

The task at hand seemed simple enough. Indeed, even after 5 posts, I still feel the frustration of not finding a generic solution that is both scalable and abstract. We’ve tried simple synchronous I/O, then switched to async, only to find that our waits can’t be satisfied because they are serviced by the worker threads that we are using to wait for the reads to complete. This took us to a polling strategy, that, once again, failed because the classes we are working with do not allow us to poll without blocking. The OS could have saved the day but because we’re in the belly of .Net, we have to play with its rules and that meant the OS callbacks had to be serviced on the same worker threads we are blocking. The ThreadPool is another stubbornly designed process-wide object that doesn’t allow us to gracefully and, more importantly, thread-safely, query and modify.

This means that we either need to explicitly use Tasks and the ProcessExecutor class we designed in the previous post or we need to roll our own internal thread pool to service the waits.

It is very easy to overlook a cyclic dependency such as this, especially in the wake of abstraction and separation of responsibilities. The complex nature of things was the perfect setup to overlook the subtle assumptions and expectations of each part of the code: the process spawning, the I/O readers, Parallel.ForEach and ultimately, the generic and omnipresent, ThreadPool.

The only hope for solving similar problems is to first find them. By incorrectly assuming (as I did) that any thread-safe function can be wrapped in Parallel.ForEach, and patting oneself for the marvels and simplicity of modern programming languages and for being proud of our silently-failing achievement, we only miss the opportunity to do so. By testing our code and verifying our assumptions, with skepticism and cynicism, rather than confidence and pride, do we stand a chance at finding out the sad truth, at least on the (hopefully) rare cases that we fail. Or abstraction fails, at any rate.

I can only wonder with a grin about other similar cases in the wild and how they are running slower than their brain-dead, one-at-a-time versions.

April 8, 2013
Posted by Ashod Nakashian at 6:46 am
No Responses
Code Snippet, From the Trenches, Guides, Programming
Tagged with: .Net, C++, Deadlock, Leaky Abstraction, Parallelization, Pipes, Threading, ThreadPool, Win32
Font Size:
A A A

Async I/O and ThreadPool Deadlock (Part 4)

Apr 052013

In part 3 we found out that executing Process instances in Parallel.ForEach could starve ThreadPool and cause a deadlock. In this part, we’ll attempt at solving the problem.

What we’d like to have is a function or class that executes and reads the outputs of an external process such that it could be used concurrently, in Parallel.ForEach or with Tasks without ill-effects or without the user taking special precautions to avoid these problems.

A Solution

A naïve solution is to make sure there is at least one thread available in the pool when blocking on the child process and its output. In other words, the number of threads in the pool must be larger than the MaxDegreeOfParallelism of the ForEach loop. This will only work when there are no other users of the ThreadPool, then we can control these two numbers to guarantee this inequality. Even then, we might potentially need a large number of threads in the pool. But the problem is inescapable as we can’t control the complete process at all times. To make matters worse, the API for changing the number of workers in the pool are non-atomic and will always have race conditions when changing these process-wide settings.

Broadly speaking, there are three solutions. The first is to replace Parallel.ForEach with something else. The second is to replace the Process class with our own, such that we have a more flexible design that avoid the wasting a thread to wait on the callback events. The third is to replace the ThreadPool. The last can be done by simply having a private pool for waiting on the callback events. That is, unless we can find a way to make the native Process work with Parallel.ForEach without worrying about this issue.

Of course the best solution would be for the ThreadPool to be smarter and increase the number of threads available. This is the case when we set our wait functions to wait indefinitely. But that takes way too long (in my case about 12 seconds) before the ThreadPool realizes that no progress is being made, and it still has some tasks schedules for execution. It (correctly) assumes that there might be some interdependency between the running-but-not-progressing threads and those tasks waiting for execution.

The third solution is overly complicated and I’d find very little reason to defend it. Thread pools are rather complicated beasts and unless we can reuse one, it’d be an overkill to develop one. The first two solution look promising, so let’s try them out.

Replacing Parallel.ForEach

Clearly the above workarounds aren’t bulletproof. We can easily end up in situations where we timeout, and so it’d be a Red Queen’s Race. A better solution is to avoid the root of the problem, namely, to avoid wasting ThreadPool threads to wait on the processes. This can be done if we could make the wait more efficient, by combining the waits of multiple processes together. In that case, it wouldn’t matter if we used a single ThreadPool thread, or a dedicated one.

To that end, we need to do two things. First, we need to convert the single function into an object that we can move around, because we’ll need to reference its locals directly. Second, we need to separate the wait from all the other setup code.

Here is a wrapper that is disposable, and wraps cleanly around Process.

        public class ProcessExecutor : IDisposable
        {
            public ProcessExecutor(string name, string path)
            {
                _name = name;
                _path = path;
            }

            public void Dispose()
            {
                Close();
            }

            public string Name { get { return _name; } }
            public string StdOut { get { return _stdOut.ToString(); } }
            public string StdErr { get { return _stdErr.ToString(); } }

            // Returns the internal process. Used for getting exit code and other advanced usage.
            // May be proxied by getters. But for now let's trust the consumer.
            public Process Processs { get { return _process; } }

            public bool Run(string args)
            {
                // Make sure we are don't have any old baggage.
                Close();

                // Fresh start.
                _stdOut.Clear();
                _stdErr.Clear();
                _stdOutEvent = new ManualResetEvent(false);
                _stdErrEvent = new ManualResetEvent(false);

                _process = new Process();
                _process.StartInfo = new ProcessStartInfo(_path)
                {
                    Arguments = args,
                    UseShellExecute = false,
                    RedirectStandardOutput = true,
                    RedirectStandardError = true,
                    ErrorDialog = false,
                    CreateNoWindow = true,
                    WorkingDirectory = Path.GetDirectoryName(_path)
                };

                _process.OutputDataReceived += (sender, e) =>
                {
                    _stdOut.AppendLine(e.Data);
                    if (e.Data == null)
                    {
                        var evt = _stdOutEvent;
                        if (evt != null)
                        {
                            lock (evt)
                            {
                                evt.Set();
                            }
                        }
                    }
                };
                _process.ErrorDataReceived += (sender, e) =>
                {
                    _stdErr.AppendLine(e.Data);
                    if (e.Data == null)
                    {
                        var evt = _stdErrEvent;
                        lock (evt)
                        {
                            evt.Set();
                        }
                    }
                };

                _sw = Stopwatch.StartNew();
                _process.Start();
                _process.BeginOutputReadLine();
                _process.BeginErrorReadLine();
                _process.Refresh();
                return true;
            }

            public void Cancel()
            {
                var proc = _process;
                _process = null;
                if (proc != null)
                {
                    // Invalidate cached data to requery.
                    proc.Refresh();

                    // Cancel all pending IO ops.
                    proc.CancelErrorRead();
                    proc.CancelOutputRead();

                    Kill();
                }

                var outEvent = _stdOutEvent;
                _stdOutEvent = null;
                if (outEvent != null)
                {
                    lock (outEvent)
                    {
                        outEvent.Close();
                        outEvent.Dispose();
                    }
                }

                var errEvent = _stdErrEvent;
                _stdErrEvent = null;
                if (errEvent != null)
                {
                    lock (errEvent)
                    {
                        errEvent.Close();
                        errEvent.Dispose();
                    }
                }
            }

            public void Wait()
            {
                Wait(-1);
            }

            public bool Wait(int timeoutMs)
            {
                try
                {
                    if (timeoutMs < 0)
                    {
                        // Wait for process and all I/O to finish.
                        _process.WaitForExit();
                        return true;
                    }

                    // Timed waiting. We need to wait for I/O ourselves.
                    if (!_process.WaitForExit(timeoutMs))
                    {
                        Kill();
                    }

                    // Wait for the I/O to finish.
                    var waitMs = (int)(timeoutMs - _sw.ElapsedMilliseconds);
                    waitMs = Math.Max(waitMs, 10);
                    _stdOutEvent.WaitOne(waitMs);

                    waitMs = (int)(timeoutMs - _sw.ElapsedMilliseconds);
                    waitMs = Math.Max(waitMs, 10);
                    return _stdErrEvent.WaitOne(waitMs);
                }
                finally
                {
                    // Cleanup.
                    Cancel();
                }
            }

            private void Close()
            {
                Cancel();
                var proc = _process;
                _process = null;
                if (proc != null)
                {
                    // Dispose in all cases.
                    proc.Close();
                    proc.Dispose();
                }
            }

            private void Kill()
            {
                try
                {
                    // We need to do this in case of a non-UI proc
                    // or one to be forced to cancel.
                    var proc = _process;
                    if (proc != null && !proc.HasExited)
                    {
                        proc.Kill();
                    }
                }
                catch
                {
                    // Kill will throw when/if the process has already exited.
                }
            }

            private readonly string _name;
            private readonly string _path;
            private readonly StringBuilder _stdOut = new StringBuilder(4 * 1024);
            private readonly StringBuilder _stdErr = new StringBuilder(4 * 1024);

            private ManualResetEvent _stdOutEvent;
            private ManualResetEvent _stdErrEvent;
            private Process _process;
            private Stopwatch _sw;
        }

Converting this to use Tasks is left as an exercise to the reader.

Now, we can use this in a different way.

        public static void Run(List<KeyValuePair<string, string>> pathArgs, int timeout)
        {
            var cts = new CancellationTokenSource();
            var allProcesses = Task.Factory.StartNew(() =>
            {
                var tasks = pathArgs.Select(pair => Task.Factory.StartNew(() =>
                {
                    string name = Path.GetFileNameWithoutExtension(pair.Key);
                    var exec = new ProcessExecutor(name, pair.Key);
                    exec.Run(pair.Value);
                    cts.Token.Register(exec.Cancel);
                    return exec;
                })).ToArray();

                // Wait for individual tasks to finish.
                foreach (var task in tasks)
                {
                    if (task != null)
                    {
                        task.Result.Wait(timeout);
                        task.Result.Cancel();
                        task.Result.Dispose();
                        task.Dispose();
                    }
                }
            });

            // Cancel, if we timed out.
            allProcesses.Wait(timeout);
            cts.Cancel();
        }

This time, we fire each process executor in a separate thread (also a worker on the ThreadPool,) but we wait in a single thread. The worker threads will run, and they will have their async reads queued, which will run once the exec.Run() functions return, and free that thread, at which point the async reads will execute.

Notice that we are creating the new tasks in a worker thread. This is so that we can limit the wait on all of them (although we’d need to have a timeout that depends on their number. But for illustration purpose the above code is sufficient. Now we have the luxury of waiting on all of them in a single Wait call and we can also cancel all of them using the CancellationTokenSource.

The above code is typically finishing in about 5-6 seconds (and as low as 4006ms) for 500 child processes on 12 cores to echo a short string and exit.

One thing to keep in mind when using this code is that if we use ProcessExecutor in another class, as a member, that class should also separate and expose the Run and Wait functions separately. That is, if it simply calls Run followed by Wait in the same function, then they will both execute on the same thread, which will result in the same problem if a number of them get executed on the ThreadPool, as we did. So the abstraction will leak again!

In addition, this puts the onus in the hands of the consumer. Our ProcessExecutor class is not bulletproof on its own.

In the next and final part we’ll go back to the Process class to try and avoid the deadlock issue without assuming a specific usage pattern.

April 5, 2013
Posted by Ashod Nakashian at 6:46 am
1 Response
Code Snippet, From the Trenches, Guides, Programming
Tagged with: .Net, C++, Deadlock, Leaky Abstraction, Parallelization, Pipes, Threading, ThreadPool, Win32
Font Size:
A A A

Async I/O and ThreadPool Deadlock (Part 3)

Apr 042013

In part 2 we discovered that by executing Process instances in Parallel.ForEach we are getting reduced performance and our waits are timing out. In this part we’ll dive deep into the problem and find out what is going on.

Deadlock

Clearly the problem has to do with the Process class and/or with spawning processes in general. Looking more carefully, we notice that besides the spawned process we also have two asynchronous reads. We have intentionally requested these to be executed on separate threads. But that shouldn’t be a problem, unless we have misused the async API.

It is reasonable to suspect that the approach used to do the async reading is at fault. This is where I ventured to look at the Process class code. After all, it’s a wrapper on Win32 API, and it might make assumptions that I was ignoring or contradicting. Regrettably, that didn’t help to figure out what was going on, except for initiating me to write the said previous post.

Looking at the BeginOutputReadLine() function, we see it creating an AsyncStreamReader, which is internal to the .Net Framework and then calls BeginReadLine(), which presumably is where the async action happens.

        [System.Runtime.InteropServices.ComVisible(false)] 
        public void BeginOutputReadLine() {

            if(outputStreamReadMode == StreamReadMode.undefined) { 
                outputStreamReadMode = StreamReadMode.asyncMode;
            } 
            else if (outputStreamReadMode != StreamReadMode.asyncMode) {
                throw new InvalidOperationException(SR.GetString(SR.CantMixSyncAsyncOperation));
            }

            if (pendingOutputRead)
                throw new InvalidOperationException(SR.GetString(SR.PendingAsyncOperation)); 

            pendingOutputRead = true;
            // We can't detect if there's a pending sychronous read, tream also doesn't. 
            if (output == null) {
                if (standardOutput == null) {
                    throw new InvalidOperationException(SR.GetString(SR.CantGetStandardOut));
                } 

                Stream s = standardOutput.BaseStream; 
                output = new AsyncStreamReader(this, s, new UserCallBack(this.OutputReadNotifyUser), standardOutput.CurrentEncoding); 
            }
            output.BeginReadLine(); 
        }

Within the AsyncStreamReader.BeginReadLine() we see the familiar asynchronous read on Stream using Stream.BeginRead().

        // User calls BeginRead to start the asynchronous read
        internal void BeginReadLine() { 
            if( cancelOperation) {
                cancelOperation = false; 
            } 

            if( sb == null ) { 
                sb = new StringBuilder(DefaultBufferSize);
                stream.BeginRead(byteBuffer, 0 , byteBuffer.Length,  new AsyncCallback(ReadBuffer), null);
            }
            else { 
                FlushMessageQueue();
            } 
        }

Unfortunately, I had incorrectly assumed that this async wait was executed on one of the I/O Completion Port threads of the ThreadPool. It seems that this is not the case, as ThreadPool.GetAvailableThreads() always returned the same number for completionPortThreads (incidentally, workerThreads value didn’t change much as well, but I didn’t notice that at first).

A breakthrough came when I started changing the maximum parallelism (i.e. maximum thread count) of Parallel.ForEach.

        public static void ExecAll(List<KeyValuePair<string, string>> pathArgs, int timeout, int maxThreads)
        {
            Parallel.ForEach(pathArgs, new ParallelOptions { MaxDegreeOfParallelism = maxThreads },
                             arg => Task.Factory.StartNew(() => DoJob(arg.Value.Split(' '))).Wait() );
        }

I thought I should increase the maximum number of threads to resolve the issue. Indeed, for certain values of MaxDegreeOfParallelism, I could never reproduce the problem (all processes finished very swiftly, and no timeouts). For everything else, the problem was reproducible most of the time. Nine out of ten I’d get timeouts. However, and to my surprise, the problem went away when I reduced MaxDegreeOfParallelism!

The magic number was 12. Yes, the number of cores at disposal on my dev machine. If we limit the number of concurrent ForEach executions to less than 12, everything finishes swiftly, otherwise, we get timeouts and finishing ExecAll() takes a long time. In fact, with maxThreads=11, 500 process executions finish under 8500ms, which is very commendable. However, with maxThreads=12, every 12 process wait until they timeout, which would take several minutes to finish all 500.

With this information, I tried increasing the ThreadPool limit of threads using ThreadPool.SetMaxThreads(). But it turns out the defaults are 1023 worker threads and 1000 for I/O Completion Port threads, as reported by ThreadPool.GetMaxThreads(). I was assuming that if the available thread count was lower than the required, the ThreadPool would simply create new threads until it reached the maximum configured value.

Diagram showing deadlock (created by www.gliffy.com)

Putting It All Together

The assumption that Parallel.ForEach executes its body on the ThreadPool, assuming said body is a black-box is clearly flawed. In our case the body is initiating asynchronous I/O which needs their own threads. Apparently, these do not come from the I/O thread pool but the worker thread pool. In addition, the number of threads in this pool is initially set to that of the available number of cores on the target machine. Even worse, it will resist creating new threads until absolutely necessary. Unfortunately, in our case it’s too late, as our waits are timing out. What I left until this point (both for dramatic effect and to leave the solution to you, the reader, to find out) is that the timeouts were happening on the StandardOutput and StandardError streams. That is, even though the child processes had exited a long time ago, we were still waiting to read their output.

Let me spell it out, if it’s not obvious: Each call to spawn and wait for a child process is executed on a ThreadPool worker thread, and is using it exclusively until the waits return. The async stream reads on StandardOutput and StandardError need to run on some thread. Since they are apparently queued to run on a ThreadPool thread, they will starve if we use all of the available threads in the pool to wait on them to finish. Thereby timing out on the read waits (because we have a deadlock).

This is a case of Leaky Abstraction, as our black box of a “execute on ThreadPool” failed miserably when the code executed itself depended on the ThreadPool. Specifically, when we had used all available threads in the pool, we left none for our code that depends on the ThreadPool to use. We shot ourselves in the proverbial foot. Our abstraction failed.

In the next part we’ll attempt to solve the problem.

April 4, 2013
Posted by Ashod Nakashian at 6:46 am
2 Responses
Code Snippet, From the Trenches, Guides, Programming
Tagged with: .Net, C++, Deadlock, Leaky Abstraction, Parallelization, Pipes, Threading, ThreadPool, Win32
Font Size:
A A A

Async I/O and ThreadPool Deadlock (Part 2)

Apr 032013

In part 1 we discovered a deadlock in the synchronous approach to reading the output of Process and we solved it using asynchronous reads. Today we’ll parallelize the execution in an attempt to maximize efficiency and concurrency.

Parallel Execution

Now, let’s complicate our lives with some concurrency, shall we?

If we are to spawn many processes, we could (and should) utilize all the cores at our disposal. Thanks to Parallel.For and Parallel.ForEach this task is made much simpler than otherwise.

        public static void ExecAll(List<KeyValuePair<string, string>> pathArgs, int timeout)
        {
            Parallel.ForEach(pathArgs, arg => ExecWithAsyncTasks(arg.Key, arg.Value, timeout));
        }

Things couldn’t be any simpler! We pass a list of executable paths and their arguments as KeyValuePair and a timeout in milliseconds. Except, this won’t work… at least not always.

First, let’s discuss how it will not work, then let’s understand the why before we attempt to fix it.

When Abstraction Backfires

The above code works like a charm in many cases. When it doesn’t, a number of waits timeout. This is unacceptable as we wouldn’t know if we got all the output or part of it, unless we get a clean exit with no timeouts. I first noticed this issue in a completely different way. I was looking at the ~~task manager~~ Process Explorer (if not using it, start now and I promise not to tell anyone,) to see how amazingly faster things are with that single ForEach line. I was expecting to see a dozen or so (on a 12-core machine) child processes spawning and vanishing in quick succession. Instead, and to my chagrin, I saw most of the time just one child! One!

And after many trials and head-scratching and reading, it became clear that the waits were timing out, even though clearly the children had finished and exited. Indeed, because typically a process would run in much less time than the timeout, it was now slower with the parallelized code than with the sequential version. This wasn’t obvious at first, and reasonably I suspected some children were taking too long, or they had too much to write to the output pipes that could be deadlocking (which wasn’t unfamiliar to me).

Testbed

To troubleshoot something as complex as this, one should start with clean test-case, with minimum number of variables. This calls for a dummy child that would do exactly as I told it, so that I could simulate different scenarios. One such scenario would be not to spawn any children at all, and just test the Parallel.ForEach with some in-proc task (i.e. just a local function that does similar work to that of a child).

using System;
using System.Threading;

namespace Child
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 2 || args.Length % 2 != 0)
            {
                Console.WriteLine("Usage: [echo|fill|sleep|return] ");
                return;
            }

            DoJob(args);
        }

        private static void DoJob(string[] args)
        {
            for (int argIdx = 0; argIdx < args.Length; argIdx += 2)
            {
                switch (args[argIdx].ToLowerInvariant())
                {
                    case "echo":
                        Console.WriteLine(args[argIdx + 1]);
                        break;

                    case "fill":
                        var rd = new Random();
                        int bytes = int.Parse(args[argIdx + 1]);
                        while (bytes-- > 0)
                        {
                            // Generate a random string as long as the .
                            Console.Write(rd.Next('a', 'z'));
                        }
                        break;

                    case "sleep":
                        Thread.Sleep(int.Parse(args[argIdx + 1]));
                        break;

                    case "return":
                        Environment.ExitCode = int.Parse(args[argIdx + 1]);
                        break;

                    default:
                        Console.WriteLine("Unknown command [" + args[argIdx] + "]. Skipping.");
                        break;
                }
            }
        }
    }
}

Now we can give the child process commands to change its behavior, from dumping data to its output to sleeping to returning immediately.

Once the problem is reproduced, we can narrow it down to pin-point the source. Running the exact same command in the same process (i. e. without spawning another process) results in no problems at all. Calling DoJob 500 times directly in Parallel.ForEach finishes in under 500ms (often under 450ms). So we can be sure Parallel.ForEach is working fine.

        public static void ExecAll(List<KeyValuePair<string, string>> pathArgs, int timeout)
        {
            Parallel.ForEach(pathArgs, arg => Task.Factory.StartNew(() => DoJob(arg.Value.Split(' '))).Wait() );
        }

Even executing as a new task (within the Parallel.ForEach) doesn’t result in any noticeable different in time. The reason for this good performance when running the jobs in new tasks is probably because the ThreadPool scheduler does fetch the task to execute immediately when we call Wait() and executes it. That is, because both the Task.Factory.StartNew() call as well as the DoJob() call are executed ultimately on the ThreadPool, and because Task is designed specifically to utilize it, when we call Wait() on the task, it knows that it should schedule the next job in the queue, which in this case is the job of the task on which we executed the Wait! Since the caller of Wait() happens to be running on the ThreadPool, it simply executes it instead of scheduling it on a different thread and blocking. Dumping the Thread.CurrentThread.ManagedThreadId from before the Task.Factory.StartNew() call and from within DoJob shows that indeed both are executed in the same thread. The overhead of creating and scheduling a Task is negligible, so we don’t see much of a change in time over 500 executions.

All this is great and comforting, but still doesn’t help us resolve the problem at hand: why aren’t our processes spawned and executed at the highest possible efficiency? And why are they timing out?

In the next part we’ll dive deep into the problem and find out what is going on.

April 3, 2013
Posted by Ashod Nakashian at 6:46 am
1 Response
Code Snippet, From the Trenches, Guides, Programming
Tagged with: .Net, C++, Deadlock, Leaky Abstraction, Parallelization, Pipes, Threading, ThreadPool, Win32
Font Size:
A A A

Async I/O and ThreadPool Deadlock (Part 1)

Apr 022013

I’ve mentioned in a past post that it was conceived while reading the source code for the System.Diagnostics.Process class. This post is about the reason that pushed me to read the source code in an attempt to fix the issue. It turned out that this was yet another case of Leaky Abstraction, which is a special interest of mine.

As it turned out, this post ended being way too long (even for me). I don’t like installments, but I felt that it is something that is worth trying as the size was prohibitive for single-post consumption. As such, I’ve split it up on 5 parts, so that each part would be around a 1000 words or less. I’ll post one part a day.

To give you an idea of the scope and subject of what’s to come, here is a quick overview. In part 1 I’ll lay out the problem. We are trying to spawn processes, read their output and kill if they take too long. Our first attempt is to use simple synchronous I/O to read the output and discover a deadlock. We solve the deadlock using asynchronous I/O. In part 2 we parallelize the code and discover reduced performance and yet another deadlock. We create a testbed and set about to investigate the problem at depth. In part 3 we will find out the root cause and we’ll discuss the mechanics (how and why) we hit such a problem. In part 4 we’ll discuss solutions to the problem and develop a generic solutions (with code) to fix the problem. Finally, in part 5 we see whether or not a generic solution could work before we summarize and conclude.

Let’s begin at the very beginning. Suppose you want to execute some program (call it child), get all its output (and error) and, if it doesn’t exit within some time limit, kill it. Notice that there is no interaction and no input. This is how tests are executed in Phalanger using a test runner.

Synchronous I/O

The Process class has conveniently exposed the underlying pipes to the child process using stream instances StandardOutput and StandardError. And, like many, we too might be tempted to simply call StandardOutput.ReadToEnd() and StandardError.ReadToEnd(). Albeit, that would work, until it doesn’t. As Raymond Chen noted, it’ll work as long as the data fits into the internal pipe buffer. The problem with this approach is that we are asking to read until we reach the end of the data, which will only happen for certainty when the child process we spawned exits. However, when the buffer of the pipe which the child writes its output to is full, the child has to wait until there is free space in the buffer to write to. But, you say, what if we always read and empty the buffer?

Good idea, except, we need to do that for both StandardOutput and StandardError at the same time. In the StandardOutput.ReadToEnd() call we read every byte coming in the buffer until the child process exits. While we have drained the StandardOutput buffer (so that the child process can’t be possibly blocked on that,) if it fills the StandardError buffer, which we aren’t reading yet, we will deadlock. The child won’t exit until it fully writes to the StandardError buffer (which is full because no one is reading it,) meanwhile, we are waiting for the process to exit so we can be sure we read to the end of the StandardOutput before we return (and start reading StandardError). The same problem exists for StandardOutput, if we first read StandardError, hence the need to drain both pipe buffers as they are fed, not one after the other.

Async Reading

The obvious (and only practical) solution is to read both pipes at the same time using separate threads. To that end, there are mainly two approaches. The pre-4.0 approach (async events), and the 4.5-and-up approach (tasks).

Async Reading with Events

The code is reasonably straight forward as it uses .Net events. We have two manual-reset events and two delegates that get called asynchronously when we read a line from each pipe. We get null data when we hit the end of file (i.e. when the process exits) for each of the two pipes.

        public static string ExecWithAsyncEvents(string path, string args, int timeoutMs)
        {
            using (var outputWaitHandle = new ManualResetEvent(false))
            {
                using (var errorWaitHandle = new ManualResetEvent(false))
                {
                    using (var process = new Process())
                    {
                        process.StartInfo = new ProcessStartInfo(path);
                        process.StartInfo.Arguments = args;
                        process.StartInfo.UseShellExecute = false;
                        process.StartInfo.RedirectStandardOutput = true;
                        process.StartInfo.RedirectStandardError = true;
                        process.StartInfo.ErrorDialog = false;
                        process.StartInfo.CreateNoWindow = true;

                        var sb = new StringBuilder(1024);
                        process.OutputDataReceived += (sender, e) =>
                        {
                            sb.AppendLine(e.Data);
                            if (e.Data == null)
                            {
                                outputWaitHandle.Set();
                            }
                        };
                        process.ErrorDataReceived += (sender, e) =>
                        {
                            sb.AppendLine(e.Data);
                            if (e.Data == null)
                            {
                                errorWaitHandle.Set();
                            }
                        };

                        process.Start();
                        process.BeginOutputReadLine();
                        process.BeginErrorReadLine();

                        process.WaitForExit(timeoutMs);
                        outputWaitHandle.WaitOne(timeoutMs);
                        errorWaitHandle.WaitOne(timeoutMs);

                        process.CancelErrorRead();
                        process.CancelOutputRead();

                        return sb.ToString();
                    }
                }
            }
        }

We certainly can improve on the above code (for example we should make the total wait limit <= timeoutMs) but you get the point with this sample. Also, no error handling or killing the child process when it times out and doesn’t exit.

Async Reading with Tasks

A much more simplified and sanitized approach is to use the new System.Threading.Tasks namespace/framework to do all the heavy-lifting for us. As you can see, the code has been cut by half and it’s much more readable, but we need Framework 4.5 and newer for this to work (although my target is 4.0, but for comparison purposes I gave it a spin). The results are the same.

        public static string ExecWithAsyncTasks(string path, string args, int timeout)
        {
            using (var process = new Process())
            {
                process.StartInfo = new ProcessStartInfo(path);
                process.StartInfo.Arguments = args;
                process.StartInfo.UseShellExecute = false;
                process.StartInfo.RedirectStandardOutput = true;
                process.StartInfo.RedirectStandardError = true;
                process.StartInfo.ErrorDialog = false;
                process.StartInfo.CreateNoWindow = true;

                var sb = new StringBuilder(1024);

                process.Start();
                var stdOutTask = process.StandardOutput.ReadToEndAsync();
                var stdErrTask = process.StandardError.ReadToEndAsync();

                process.WaitForExit(timeout);
                stdOutTask.Wait(timeout);
                stdErrTask.Wait(timeout);

                return sb.ToString();
            }
        }

Again, a healthy doze of error-handling is in order, but for illustration purposes left out. A point worthy of mention is that we can’t assume we read the streams by the time the child exits. There is a race condition and we still need to wait for the I/O operations to finish before we can read the results.

In the next part we’ll parallelize the execution in an attempt to maximize efficiency and concurrency.

April 2, 2013
Posted by Ashod Nakashian at 6:46 am
No Responses
Code Snippet, From the Trenches, Guides, Programming
Tagged with: .Net, C++, Deadlock, Leaky Abstraction, Parallelization, Pipes, Threading, ThreadPool, Win32
Font Size:
A A A

Thread Abort and Critical Regions (in .Net)

Mar 182013

I don’t need to see the source code of an API to code against. In fact, I actively discourage against depending (even psychologically) on the inner details of an implementation. The contract should be sufficient. Of course I’m assuming a well-designed API with good (at least decent) documentation. But ~~sometimes~~ often reality is more complicated than an ideal world.

Empty try{}

While working with System.Diagnostics.Process in the context of Parallel.ForEach things became a bit too complicated. (I’ll leave the gory details to another post.) What prompted this post was a weird pattern that I noticed while browsing Process.cs, the source code for the Process class (to untangle said complicated scenario).

RuntimeHelpers.PrepareConstrainedRegions();
try {} finally {
   retVal = NativeMethods.CreateProcessWithLogonW(
		   startInfo.UserName,
		   startInfo.Domain,
		   password,
		   logonFlags,
		   null,            // we don't need this since all the info is in commandLine
		   commandLine,
		   creationFlags,
		   environmentPtr,
		   workingDirectory,
		   startupInfo,        // pointer to STARTUPINFO
		   processInfo         // pointer to PROCESS_INFORMATION
	   );
   if (!retVal)
	  errorCode = Marshal.GetLastWin32Error();
   if ( processInfo.hProcess!= (IntPtr)0 && processInfo.hProcess!= (IntPtr)NativeMethods.INVALID_HANDLE_VALUE)
	  procSH.InitialSetHandle(processInfo.hProcess);
   if ( processInfo.hThread != (IntPtr)0 && processInfo.hThread != (IntPtr)NativeMethods.INVALID_HANDLE_VALUE)
	  threadSH.InitialSetHandle(processInfo.hThread);
}

This is from StartWithCreateProcess(), a private method of Process. So no surprises that it’s doing a bunch of Win32 native API calls. What stands out is the try{} construct. But also notice the RuntimeHelpers.PrepareConstrainedRegions() call.

Thinking of possible reasons for this, I suspected it had to do with run-time guarantees. The RuntimeHelpers.PrepareConstrainedRegions() call is a member of CER. So why the need to use empty try if we have the PrepareConstrainedRegions call? Regrettably, I confused it with the empty try clause. In reality, the empty try construct has nothing to do with CER and everything with execution interruption by means of ThreadAbortException.

A quick search hit Siddharth Uppal’s The empty try block mystery where he explains that Thread.Abort() never interrupts code in the finally clause. Well, not quite.

Thread.Abort and finally

A common interview question, after going through the semantics of try/catch/finally, is to ask the candidate if finally is always executed (since usually that’s the wording they use). Are there no scenarios where one could conceivably expect their code in finally never to get executed? A creative response is usually when we have an an “infinite loop” in the try or catch (if it gets called). Novice candidates are easily confused when they consider a premature termination of the process (or thread). After all, one would expect that there be some consistency in the behavior of code. So why shouldn’t finally always get executed, even in a process termination scenario?

It’s not difficult to see that there is a struggle of powers between the termination party and the code/process in question. If finally always executes, there would be no guarantees for termination. Yes, we cannot guarantee both that finally will always execute while guaranteeing that termination will always succeed. One or both must have weak guarantees (or at least weaker guarantees than the other). When push comes to shove, we (as users or administrators) want to have full control over our machines, so we choose to have the ultimate magic wand to kill and terminate any misbehaving (or just undesirable) process.

The story is a little bit different when it comes to aborting individual threads, however. Where on the process level the operating system can terminate it with a sweeping gesture, in the managed world things are more controlled. The CLR can see to it that any thread that is about to get terminated (by a call to Thread.Abort()) is done cleanly and respecting all the language and runtime rules. This includes executing finally blocks as well as finalizers.

ThreadAbortException weirdness

When aborting a thread, the apparent behavior is one of an exception thrown from within the thread in question. When another thread invokes Thread.Abort() on our thread, a ThreadAbortException is raised from our thread code. This is called asynchronous thread abort, as opposed to synchronous abort, when a thread invokes Thread.CurrentThread.Abort() (invariantly on itself). Other asynchronous exceptions include OutOfMemoryException and StackOverflowException.

The behavior, then, is exactly as one would expect when an exception is raised. The exception bubbles up the stack, executing catch and finally blocks as one would expect from any other exception. There are, however, a couple of crucial differences between ThreadAbortException and other Exceptions (with the exception of StackOverflowException, which can’t be caught at all). First, this exception can’t be suppressed by simply catching it – it is automatically rethrown right after exiting the catch clause that caught it. Second, throwing it does not abort the running thread (it must be done via a call to Thread.Abort()).

The source for this behavior of ThreadAbortException is the abort requested flag, which is set when Thread.Abort() is invoked (but not when it is thrown directly). CLR then checks for this flag at certain check-points and proceeds to raise the exception, which normally is raised between any two machine instructions. This guarantees that the exception will not get thrown when executing a finally block or when executing unmanaged code.

So the expectation of the novice interviewee (and Mr. Uppal’s) was right after all. Except, it wasn’t. We are back full circle to the problem between the purpose of aborting a thread, and the possibility of an ill behaved code never giving up at all. I am being too generous when I label code that wouldn’t yield to a request to abort as “ill behaved.” Because ThreadAbortException is automatically rethrown from catch causes, the only way to suppress it is to explicitly call Thread.ResetAbort() which clears the abort requested flag. This is intentional as developers are in the habit of writing catching-all clauses very frequently.

AppDomain hosting

So far we’ve assumed that we just might need to terminate a process, no questions asked. But why would one need to abort individual threads within a process? The answer lies with hosting. In environments such as IIS or SQL servers, the server should be both fast and reliable. This led to the design of compartmentalizing processes beyond threads. AppDomain groups processing units within a single process such that spawning new instances is fast (faster than spawning a complete new process,) but at the same time it’s grouped such that they can be unloaded on demand. When an AppDomain instance takes longer than the configured time (or consumes some resource more than it should,) it’s deemed ill-behaved and the server will want to unload it. Unloading includes aborting every thread within the AppDomain in question.

The problem is yet again one of conflict between guarantees. This time, though, the termination logic needs to play along, or else. When terminating a process, the complete process memory is released, along with all its system resources. If the managed code or CLR don’t do that, the operating system will. In a hosted execution environment, the host wants to have full control over the life-time of an AppDomain (with all its threads,) all the while, when it decides to purge of it, it does not want to destabilize the process or, worse, itself or the system at large. When unloading an AppDomain, the server wants to give it a chance to cleanup and release any shared resources, including files and sockets and synchronization objects (i.e. locks,) to name but a few. This is because the process will continue running, hopefully for a very long time. Hence the behavior of ThreadAbortException that calls every catch and finally as it should.

In return, any process that wants to play rough gets to call Thread.ResetAbort() and go on with its life, thereby defeating the control that the server enjoyed. The server invariantly has the upper hand, of course. After a second limit is exceeded, after invoking Thread.Abort(), in the words of Tarantino, the server may “go medieval” on the misbehaving AppDomain.

Rude Abort/Unload/Exit

When a thread requested to abort doesn’t play along, it warrants rudeness. The CLR allows a host to specify escalation policy in similar events, such that the host would escalate a normal thread abort into a rude thread abort. Similarly, a normal AppDomain unload and process exit may be escalated to a rude ones.

But we all know that the server doesn’t want to be too inconsiderate. It wouldn’t want to jeopardize its stability in the wake of this arms race between it and the hosted AppDomain. For that, it wants to have some guarantees. More stringent guarantees from the code in question that it will not misbehave again, when given half a chance. One such guarantee is that the finalization code will not have ill side-effects.

In a rude thread abort event, the CLR forgoes calling any finally blocks (except those marked as Constrained Execution Regions, or CER for short) as well as any normal finalizer. But unlike mere finally blocks, finalizers are a much more serious bunch. They are functions with consequences. Recall that finalization serves the purpose of releasing system resources. In a completely managed environment, with garbage collection and cleanup, the only resources that needs special care are those that aren’t managed. In an ideal scenario, one wouldn’t need to implement finalizers at all. The case is different when we need to wrap a system resource that is not managed (this includes native DLL invoking). All system resources that are represented by the framework are released precisely using a finalizer.

Admittedly, if we are developing standalone applications, as opposed to hosted, we don’t have to worry about the possibility of escalation and rude abort or unload. But then again, why should we worry about Thread.Abort() at all in such a case? Either our code could issue such a harsh request, which we should avoid like the plague and opt to more civil cancellation methods, such as raising events or setting shared flags (with some special care), or, our code is in a library that may be called either from a standalone application or a hosted one. Only in the latter case must we worry and prepare for the possibility of rude abort/unload.

Critical Finalization and CER

Dispose() is called in finally blocks, either manually us via the using clause. So the only correct way to dispose objects during such an upheaval is to have finalizers on these objects. And not just any finalizer, but Critical Finalizers. This is the same technique used in SafeHandle to ensure that native handles are correctly released in the event of a catastrophic failure.

When things get serious, only finalizers marked as safe are called. Unsurprisingly, attributes are used to mark methods as safe. The contract between CLR and the code is a strict one. First, we communicate how critical a function is, in the face of asynchronous exceptions by marking their reliability. Next, we void our rights to allocate memory, which isn’t trivial, since this is done transparently in some cases such as P/Invoke marshaling, locking and boxing. In addition, the only methods we can call from within a CER block are those with strong reliability guarantees. Virtual methods that aren’t prepared in advance cannot be called either.

This brings us full circle to RuntimeHelpers.PrepareConstrainedRegions(). What this call does is it tells the CLR to fully prepare the proceeding code, by allocating all necessary memory, ensuring there is sufficient stack space, JITing the code, which completely loads any assemblies we may need.

Here is a sample code that demonstrates how this works in practice. When the ReliabilityContract attribute is commented out, the try block is executed before the finally block, which fails. However, with the ReliabilityContract attribute, the PrepareConstrainedRegions() call fails to allocate all necessary memory beforehand and therefore doesn’t even attempt to execute the try clause, nor the finally, instead the exception is thrown immediately.

There are three forms to execute code in Constrained Execution Regions (from the BCL Team Blog):

ExecuteCodeWithGuaranteedCleanup, a stack-overflow safe form of a try/finally.
A try/finally block preceded immediately by a call to RuntimeHelpers.PrepareConstrainedRegions. The try block is not constrained, but all catch, finally, and fault blocks for that try are.
As a critical finalizer – any subclass of CriticalFinalizerObject has a finalizer that is eagerly prepared before an instance of the object is allocated.
- A special case is SafeHandle’s ReleaseHandle method, a virtual method that is eagerly prepared before the subclass is allocated, and called from SafeHandle’s critical finalizer.

Conclusion

Normally, CLR guarantees clean execution and cleanup (via finally and finalization code) even in the face of asynchronous exceptions and thread abort. However it does preserve the right to take a harsher measure if the host escalates things. Writing code in finally blocks to avoid dealing with the possibility of asynchronous exceptions, while not the best practice, will work. When we abuse this, by reseting abort requests and spending too long in finally blocks, the host will escalate things to rude unload and will aggressively rip the AppDomain with all its threads, bypassing finally blocks, unless in a CER block, as the above code did.

So, finally blocks are not executed when a rude abort/unload is in progress (unless in CER), when the process is terminated (by the operating system), when an unhandled exception is raised (typically in unmanaged code or in the CLR) or in background threads (IsBackground == true) when all foreground threads have exited.

March 18, 2013
Posted by Ashod Nakashian at 1:41 am
4 Responses
Best Practice, Guides, Programming
Tagged with: .Net, Async Exception, Constrained Execution Region, Exception Handling, Thread Abort
Font Size:
A A A

HttpWebRequest: When 64-bits is too much

Apr 172011

With the introduction of .Net and a new, modern framework library, developers understandably were very cheerful. A shiny new ecosystem with mint library designed without any backwards compatibility curses or legacy support. Finally, a library to take us into the future without second guessing. Well, those were the hopes and dreams of the often too-optimistic and naive.

However, if you’d grant me those simplistic titles, you’d understand my extreme disappointment when the compiler barfed on my AddRange call on HttpWebRequest with a long parameter. Apparently HttpWebRequest lacks 64-bit AddRange member.

Surely this was a small mistake, a relic from the .Net 1.0 era. Nope, it’s in 2.0 as well. Right then, I should be used 3.5. What’s wrong with me using 2.0, such an outdated version. Wrong again, I am using 3.5. But I need to resume downloads on 7GB+ files!

To me, this is truly a shocking goof. After all, .Net is supposed to be all about the agility and modernity that is the web. Three major releases of the framework and no one put a high-priority tag on this missing member? Surely my panic was exaggerated. It must be. There is certainly some simple workaround that everyone is using that makes this issue really a low-priority one.

Right then, the HttpWebRequest is a WebRequest, and I really don’t need a specialized function to set an HTTP header. Let’s set the header directly:

            HttpWebRequest request = WebRequest.Create(Uri) as HttpWebRequest;

            request.Headers["Range"] = "bytes=0-100";

To which, .Net responded with the following System.ArgumentException:

This header must be modified using the appropriate property.

Frustration! Luckily, somebody ultimately took notice of this glaring omission and added the AddRange(long, long); function to .Net 4.0.

So where does this leave us? Seems that I either have to move to .Net 4.0, write my own HttpWebRequest replacement or avoid large files altogether. Unless, that is, I find a hack.

Different solutions do exist to this problem on the web, but the most elegant one was this:

        /// <summary>
        /// Sets an inclusive range of bytes to download.
        /// </summary>
        /// <param name="request">The request to set the range to.</param>
        /// <param name="from">The first byte offset, -ve to omit.</param>
        /// <param name="to">The last byte offset, less-than from to omit.</param>
        private static void SetWebRequestRange(HttpWebRequest request, long from, long to)
        {
            string first = from >= 0 ? from.ToString() : string.Empty;
            string last = to >= from ? to.ToString() : string.Empty;

            string val = string.Format("bytes={0}-{1}", first, last);

            Type type = typeof(WebHeaderCollection);
            MethodInfo method = type.GetMethod("AddWithoutValidate", BindingFlags.Instance | BindingFlags.NonPublic);
            method.Invoke(request.Headers, new object[] { "Range", val });
        }

Since there were apparently copied pages with similar solutions, I’m a bit hesitant to give credit to any particular page or author in fear of giving credit to a plagiarizer. In return, I’ve improved the technique and put it into a flexible function. In addition, I’ve wrapped WebResponse into a reusable Stream class that plays better with non-network streams. In particular, my WebStream supports reading the Length and Position members and returns the correct results. Here is the full source code:

// --------------------------------------------------------------------------------------
// <copyright file="WebStream.cs" company="Ashod Nakashian">
// Copyright (c) 2011, Ashod Nakashian
// All rights reserved.
// 
// Redistribution and use in source and binary forms, with or without modification,
// are permitted provided that the following conditions are met:
// 
// o Redistributions of source code must retain the above copyright notice, 
// this list of conditions and the following disclaimer.
// o Redistributions in binary form must reproduce the above copyright notice, 
// this list of conditions and the following disclaimer in the documentation and/or
// other materials provided with the distribution.
// o Neither the name of the author nor the names of its contributors may be used to endorse
// or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
// OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT 
// SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
// INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
// </copyright>
// <summary>
//   Wraps HttpWebRequest and WebResponse instances as Streams.
// </summary>
// --------------------------------------------------------------------------------------

namespace Web
{
    using System;
    using System.IO;
    using System.Net;
    using System.Reflection;

    /// <summary>
    /// HTTP Stream, wraps around HttpWebRequest.
    /// </summary>
    public class WebStream : Stream
	{
		public WebStream(string uri)
            : this(uri, 0)
		{
		}

        public WebStream(string uri, long position)
		{
			Uri = uri;
			position_ = position;
		}

        #region properties

        public string Uri { get; protected set; }
        public string UserAgent { get; set; }
        public string Referer { get; set; }

        #endregion // properties

        #region Overrides of Stream

        /// <summary>
        /// When overridden in a derived class, clears all buffers for this stream and causes any buffered data to be written to the underlying device.
        /// </summary>
        /// <filterpriority>2</filterpriority>
        public override void Flush()
        {
        }

        /// <summary>
        /// When overridden in a derived class, sets the position within the current stream.
        /// </summary>
        /// <returns>
        /// The new position within the current stream.
        /// </returns>
        /// <param name="offset">A byte offset relative to the <paramref name="origin"/> parameter.</param>
        /// <param name="origin">A value of type <see cref="T:System.IO.SeekOrigin"/> indicating the reference point used to obtain the new position.</param>
        /// <filterpriority>1</filterpriority>
        /// <exception cref="NotImplementedException"><c>NotImplementedException</c>.</exception>
        public override long Seek(long offset, SeekOrigin origin)
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// When overridden in a derived class, sets the length of the current stream.
        /// </summary>
        /// <param name="value">The desired length of the current stream in bytes.</param>
        /// <filterpriority>2</filterpriority>
        /// <exception cref="NotImplementedException"><c>NotImplementedException</c>.</exception>
        public override void SetLength(long value)
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// When overridden in a derived class, reads a sequence of bytes from the current stream and advances the position within the stream by the number of bytes read.
        /// </summary>
        /// <returns>
        /// The total number of bytes read into the buffer. This can be less than the number of bytes requested if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.
        /// </returns>
        /// <param name="buffer">An array of bytes. When this method returns, the buffer contains the specified byte array with the values between <paramref name="offset"/> and (<paramref name="offset"/> + <paramref name="count"/> - 1) replaced by the bytes read from the current source.</param>
        /// <param name="offset">The zero-based byte offset in <paramref name="buffer"/> at which to begin storing the data read from the current stream.</param>
        /// <param name="count">The maximum number of bytes to be read from the current stream.</param>
        /// <filterpriority>1</filterpriority>
        /// <exception cref="System.ArgumentException">The sum of offset and count is larger than the buffer length.</exception>
        /// <exception cref="System.ArgumentNullException">buffer is null.</exception>
        /// <exception cref="System.ArgumentOutOfRangeException">offset or count is negative.</exception>
        /// <exception cref="System.NotSupportedException">The stream does not support reading.</exception>
        /// <exception cref="System.ObjectDisposedException">Methods were called after the stream was closed.</exception>
		public override int Read(byte[] buffer, int offset, int count)
		{
            if (stream_ == null)
            {
                Connect();
            }

            try
            {
                if (stream_ != null)
                {
                    int read = stream_.Read(buffer, offset, count);
                    position_ += read;
                    return read;
                }
            }
            catch (WebException)
            {
                Close();
            }
            catch (IOException)
            {
                Close();
            }

            return -1;
		}

        /// <summary>
        /// When overridden in a derived class, writes a sequence of bytes to the current stream and advances the current position within this stream by the number of bytes written.
        /// </summary>
        /// <param name="buffer">An array of bytes. This method copies <paramref name="count"/> bytes from <paramref name="buffer"/> to the current stream.</param>
        /// <param name="offset">The zero-based byte offset in <paramref name="buffer"/> at which to begin copying bytes to the current stream.</param>
        /// <param name="count">The number of bytes to be written to the current stream.</param>
        /// <filterpriority>1</filterpriority>
        /// <exception cref="NotImplementedException"><c>NotImplementedException</c>.</exception>
        public override void Write(byte[] buffer, int offset, int count)
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// When overridden in a derived class, gets a value indicating whether the current stream supports reading.
        /// Always returns true.
        /// </summary>
        /// <returns>
        /// true if the stream supports reading; otherwise, false.
        /// </returns>
        /// <filterpriority>1</filterpriority>
        public override bool CanRead
        {
            get { return true; }
        }

        /// <summary>
        /// When overridden in a derived class, gets a value indicating whether the current stream supports seeking.
        /// Always returns false.
        /// </summary>
        /// <returns>
        /// true if the stream supports seeking; otherwise, false.
        /// </returns>
        /// <filterpriority>1</filterpriority>
        public override bool CanSeek
        {
			get { return false; }
        }

        /// <summary>
        /// When overridden in a derived class, gets a value indicating whether the current stream supports writing.
        /// Always returns false.
        /// </summary>
        /// <returns>
        /// true if the stream supports writing; otherwise, false.
        /// </returns>
        /// <filterpriority>1</filterpriority>
        public override bool CanWrite
        {
			get { return false; }
        }

        /// <summary>
        /// When overridden in a derived class, gets the length in bytes of the stream.
        /// </summary>
        /// <returns>
        /// A long value representing the length of the stream in bytes.
        /// </returns>
        /// <exception cref="T:System.ObjectDisposedException">Methods were called after the stream was closed.</exception>
        /// <filterpriority>1</filterpriority>
        public override long Length
        {
            get { return webResponse_.ContentLength; }
        }

        /// <summary>
        /// When overridden in a derived class, gets or sets the position within the current stream.
        /// </summary>
        /// <returns>
        /// The current position within the stream.
        /// </returns>
        /// <filterpriority>1</filterpriority>
        /// <exception cref="NotSupportedException"><c>NotSupportedException</c>.</exception>
        public override long Position
        {
			get { return position_; }
			set { throw new NotSupportedException(); }
        }

        #endregion // Overrides of Stream

        #region operations

        /// <summary>
        /// Reads the full string data at the given URI.
        /// </summary>
        /// <returns>The full contents of the given URI.</returns>
        public static string ReadToEnd(string uri, string userAgent, string referer)
        {
            using (WebStream ws = new WebStream(uri, 0))
            {
                ws.UserAgent = userAgent;
                ws.Referer = referer;
                ws.Connect();

                using (StreamReader reader = new StreamReader(ws.stream_))
                {
                    return reader.ReadToEnd();
                }
            }
        }

        /// <summary>
        /// Writes the full data at the given URI to the given stream.
        /// </summary>
        /// <returns>The number of bytes written.</returns>
        public static long WriteToStream(string uri, string userAgent, string referer, Stream stream)
        {
            using (WebStream ws = new WebStream(uri, 0))
            {
                ws.UserAgent = userAgent;
                ws.Referer = referer;
                ws.Connect();

                long total = 0;
                byte[] buffer = new byte[64 * 1024];
                int read;
                while ((read = ws.stream_.Read(buffer, 0, buffer.Length)) > 0)
                {
                    stream.Write(buffer, 0, read);
                    total += read;
                }

                return total;
            }
        }

        #endregion // operations

        #region implementation

        protected override void Dispose(bool disposing)
        {
            base.Dispose(disposing);

            if (stream_ != null)
            {
                stream_.Dispose();
                stream_ = null;
            }
        }

        private void Connect()
        {
            Close();

            HttpWebRequest request = WebRequest.Create(Uri) as HttpWebRequest;
            if (request == null)
            {
                return;
            }

            request.UserAgent = UserAgent;
            request.Referer = Referer;
            if (position_ > 0)
            {
                SetWebRequestRange(request, position_, 0);
            }

            webResponse_ = request.GetResponse();
            stream_ = webResponse_.GetResponseStream();
        }

        /// <summary>
        /// Sets an inclusive range of bytes to download.
        /// </summary>
        /// <param name="request">The request to set the range to.</param>
        /// <param name="from">The first byte offset, -ve to omit.</param>
        /// <param name="to">The last byte offset, less-than from to omit.</param>
        private static void SetWebRequestRange(HttpWebRequest request, long from, long to)
        {
            string first = from >= 0 ? from.ToString() : string.Empty;
            string last = to >= from ? to.ToString() : string.Empty;

            string val = string.Format("bytes={0}-{1}", first, last);

            Type type = typeof(WebHeaderCollection);
            MethodInfo method = type.GetMethod("AddWithoutValidate", BindingFlags.Instance | BindingFlags.NonPublic);
            method.Invoke(request.Headers, new object[] { "Range", val });
        }

        #endregion // implementation

        #region representation

        private long position_;
        private WebResponse webResponse_;
        private Stream stream_;

        #endregion // representation
	}
}

I hope this saves someone some frustration and perhaps even time writing this handy class. Enjoy.