But we're not that interested in that particular scenario - we're more interested in thinking about the symbols themselves. So what we can do to simply things is just set m=1 - the system emits one symbol per second, so then we just have H' = H (numerically speaking; of course the units are different, but the meaning is quite similar).
To understand the proof of the theorem, we relied on the simple example in the paper (the symbols A, B, C, D, with some probabilities attached to them). This example is actually too simple to show the role of N, but it's still possible to explain it, I think, without having to look too closely at a more complicated example.
Here's the idea. Consider English (which we've mentioned a few times in class, and which Shannon spends a couple pages on - you may want to look at his techniques in the paper for constructing "statistical approximations" to English - it's both interesting and relevant to this discussion). Each letter in the alphabet has a certain probability of occurring, as we've noted. E has a relatively high probability; Q has a relatively low probability. So far this sounds very much like the simple A-B-C-D example we looked at (except with more symbols). But we can actually say more complicated things about English. What if we consider pairs of letters? The probability of seeing an H is (let's suppose) 0.05. But the probability of seeing an H immediately following a T is probably a bit higher, maybe 0.07; and the probability of seeing an H immediately following a Q is much lower, almost 0. This level of detail can't be captured by looking only at single-letter probabilities, but if we look at pairs we can get that kind of information. So we could make (as people already have) a table of all 26x26 pairs of letters and assign them probabilities - pairs like EE, TH, IO would have relatively high probabilities, and pairs like JQ, PZ, XY would have almost zero probability. We could extend this to triples of letters - even though TH is a relatively common pair, the triple ETH is far more likely than VTH. And so on.
The point is that when you have a "real" example, just looking at messages that are one letter long isn't going to let you capture the full statistical profile of the source, but by increasing N, we get closer and closer to the real measure - we're able to be increasingly clever in how we encode the symbols into bits.
I think that covers most of what we were struggling with; if you're still feeling confused about any of this stuff, email me, and I'll try to provide a response either here or privately.