Probability
Oxford notes for Part A probability.
Buzzwords
Probability space = modelling an experiment. is the space of outcomes, is the space of events, gives a measure of probability.
Memoryless property of geometric and exponential distributions = kvot bilo bilo.
close connections between the Poisson distribution and the exponential distribution - think Poisson processes.
almost surely (with probability 1) implies convergence in probability implies convergence in distribution.
WLLN
Weak law of large numbers is limit in probability. Proven using Chebyshev's inequality (requires finite variance)
This assumption of finite variance is not necessary (just makes proves easier). Large or infinite variance will make the convergence slower, but the LLN holds anyway.
Strong Law of Large Numbers (SLLN) is almost sure convergence.
CLT, converges in distribution. The fluctuations of the mean around is of order
risk pooling (used in CLT) , where are different portfolios (iid assumption!).
Binomial with small probability and many events is poisson
Binomial = sum of Beornoulli. If fixed , then by CLT Binomial converges to Normal distribution. Consider , where converges to 0. Assume the expected total number of successes stays approximately the same . Then Binomial converges to Poisson.
generating functions are used to compute moments and pmfs. Take kth derivative and evaluate at 0 or 1.
Pgf-s used for discrete random variables .
Mgf-s used for continuous random variables .
Mgfs and Pgfs define uniquely the distribution of random variables.
For if then in distribution as .
Proof WLLN CLT using mgfs and above fact
Chebyshev’s inequality, which is the application of Markov’s inequality to the random variable .
Markovs inequality is about bouding the tails of non-negative distributions.
Fact: For standard normal random variable
Markov chain is a sequence of variables satisfying the Markov property
.
To describe a time homogeneous Markov chain you need the initial distribution of and a transition matrix
The matrix is indexed by the state space . has non-negative entries and each row sums to 1.
Markov chains are memoryless.
Chapman-Kolmogorov equations
irreducible chains, communicating classes, closed class, periodicity of classes
recurrence, transience, mean return time, null recurrence, positive recurrence
null recurrent = mean return time is infinite but probability of goiing back infinitely many times is 1.
in a class all states are either positive recurrent or null recurrent, or transient
Random walk on is a irreducible chain with period 2.
iid is subcase of Markov chain
Ergodic theorem is generalization of SLLN
Chapter 1. Recap Prelims
- (i) is a set, which we call the sample space.
- (ii) is a collection of subsets of . An element of is called an event.
- (iii) The probability measure is a function from to [0, 1]. It assigns a probability to each event in .
A random variable is a function defined on . Maps events to numbers.
cdf, pdf, discrete and cts random variables, variance, covariance
independence, . pairwise indepence is weaker than independence.
If and are independent, then we have . As a special case, if are i.i.d. with distribution, then has distribution.
Memoryless property of geometric and exponential distributions.
If and are independent, then
Chapter 2. Convergence of random variables and limit theorems.
We need to formalise the notion that two random variables and are close.
Modes of convergence
Let be random variables. We say:
converges to almost surely (with probability 1) if
converges in probability to if for every : .
converges to in distribution(weakly) if as for all where is continuous.
These formulations are in decresing order of strength. The last one does not require distributions to be the same! This is really a definition about convergence of distributions, not about convergence of random variables.
There are many situations in which a sequence of discrete random variables converges to a continuous limit.
WLLN in convergence in probability.
, where are iid with finite variance. Then:
as proove it using Chebyshevs inequality.
CLT - convergence in distribution (standard normal). The fluctuations of around is of order .
Binomial with small probability and many events is poisson
Binomial = sum of Beornoulli. If fixed , then by CLT Binomial converges to Normal distribution. Consider , where converges to 0. Assume the expected total number of successes stays approximately the same . Then Binomial converges to Poisson.
Chapter 3. Generating functions
The generating function of random variable is
Using generating dunctions you can compute moments and the pmf of . Take kth derivative and evaluate at 0 or 1.
Theorem 3.2 (Uniqueness theorem for probability generating functions). If X and Y have the same generating function, then they have the same distribution.
Generating functions for independen rvs:
Generating functions used for random sums:
Probability genrating functions are used for discrete random variables!
Moment generating function of a random variable is
Mgf is Pgf with
We use because we can expand around . If we expand around we would no longer have power series in the general case for cts variables.
Assuming the mgf exists (the expecation is finite), Taylor expansion:
Mgfs and Pgfs define uniquely the distribution of random variables.
For if then in distribution as .
Fact For standard normal random variable .
Characteristic functions
Replace in mgf with
.
Results for mgf hold for cgf too.
Mgf vs Cgf
Cgf does not require exponentially decaying tails (it is all on complex plain). Mgf to proove CLT uses exponential decay (finite expectation).
Mgf could be used to show bounds of tails (using Markov inequality, see section 3.3). CGf just does not makes sense as we use complex numbers and cannot compare them.
Chapter 4. Joint distributions of continuous random variables
Chapter 5. Markov chains
Let be a sequence of random variables taking values in some state space . The process is called a Markov chain if for any and any ,
.
The Markov chain is called (time) homogeneous if in addition does not depend on . Then and these are called transition probabilities of the chain.
To describe a time homogeneous Markov chain you need the initial distribution of and a transition matrix
The matrix is indexed by the state space . has non-negative entries and each row sums to 1.
Markov chains are memoryless.
We can say that “the future is independent of the past, given the present”.
Theorem 5.2. (Chapman-Kolmogorov equations)
- To reach from to in steps try middle:
- To reach from to in steps:
Theorem 5.3 If is the initial distribution, then the distribution of is .
Class structure of state space of Markov chains.
We say state communicates with state if we can reach from , that is is positive for some (after certain number of steps we can reach from to ) and vice verca.
A class of states is closed if the probability to go out of the class is 0.
If the class is closed then we call it an absorbing state.
A state is irreducible if all states communicate (can reach from elsewhere to everywhere).
Period of a state is the least number of steps after which you will be back and back and back ... in , that is the gcd of the set , if it does not exist (never go back, p_{ii}^{n} = 0)then the period is not defined.
is called aperiodic if this g.c.d. is 1. Example: go back in 1, 3, 7 steps
Fact. All states in a communicating class have the same period. In particular, if a chain is irreducible, then all states have the same period.
Hitting probabilities
Define , the hitting probability of starting from state . If is a closed class we call that the absorbtion probability.
Recurrence and transience
equivalently . The state is called transient.
equivalently . The state is called recurrent.
Theorem 5.9 In a recurrent class: either all states are recurrent or all are transient.
Every recurrent class is closed. Every finite closed class is recurrent.
The theorem tells us that recurrence and transience are quite boring for finite chains: state is recurrent if and only if its communicating class is closed. But infinite chains are more interesting! An infinite closed class may be either transient or recurrent.
Therem State is recurrent iff
Random walk in
Consider a simple symmetric random walk on the -dimensional integer lattice. This is a Markov chain with state space and transition probabilities if , and otherwise. The chain is irreducible, with period 2.
For the chain is recurrent (probability of go back to 0 infinitely often is 1), and for the cahin is transient.
Mean Return Time to a state
where is the mean hitting time of starting from
If is transient then (return time to itsleft is infinite with positive with probability).
If is recurrent and , we say is null recurrent
If is recurrent and , we say is positive recurrent
If the chain is irreducible, we can therefore call the whole chain either transient, or null recurrent, or positive recurrent.
Chapter 6. Markov chains: stationary distributions and convergence to equilibrium
Let be a distribution on the state space . We say that is a stationary distribution, or invariant distribution, or equilibrium distribution, for the transition matrix if
is a left eigenvector of the matrix with eigenvalue 1.
That is for all zzz Stationary distributions are those for which after we move using transition matrix, the distribution does not change.
Theorem 6.1 (Existence and uniqueness of stationary distributions). Let be an irreducible transition matrix.
- (a) has a stationary distribution if and only if is positive recurrent.
- (b) In that case, the stationary distribution π is unique, and is given by for all (where is the mean return time to state defined at (5.12)).
Theorem 6.2 (Convergence to equilibrium). Suppose is irreducible and aperiodic, with stationary distribution . For a Markov chain with transition matrix and any initial distribution as for all .
Theorem 6.3 (Ergodic theorem). Let be irreducible. Let be the number of visits to state before time .
Then almost surely as .
The ergodic theorem concerns the “long-run proportion of time” spent in a state.
In the positive recurrent case, the theorem says the long-run proportion of time spent in a state is the stationary probability of that state .
In the null-recurrent or transient case, , so the ergodic theorem says that with probability 1 the long-run proportion of time spent in a state is 0. zz We can see the ergodic theorem as a generalisation of the strong law of large numbers. The ergodic theorem can be seen as extending this to the case where Xn is not i.i.d. but is a Markov chain. IID is stronger asumption that Markov propery.
Intuition about the convergence theorems.
The idea will be that after a long time, a Markov chain should more or less “forget where it started”. There are essentially two reasons why this might not happen:
-
(a) periodicity; for example if a chain has period 2, then it alternates between, say, “odd” and “even” states; even an arbitrarily long time, the chain will still remember whether it started at an “odd” or “even” state.
-
(b) lack of irreducibility. A chain with more than one closed class can never move from one to the other, and so again will retain some memory of where it started, forever
Thus for convergence to equillibrium (which does not depend on initial distribution) we require aperiodicity and irreducibility
Poisson processes
A Poisson process is a natural model for a stream of events occuring one by one in continuous time, in an uncoordinated way.
Example:
- the process of times of detections by a Geiger counter near a radioactive source (a very accurate model)
- the process of times of arrivals of calls at a call centre (often a good model)
- the process of times of arrivals of buses at a bus stop (probably an inaccurate model; different buses are not really uncoordinated, for various reasons
Definition A counting process is a random process for , where time is continuous and is a random variable which takes values in and for .
Arrival process
If describes an arrival process, then means that there hae been arrivals in the time interval .
In fact we can describe the process by the sequence of arrival times, which we might call “points” of the process. Let for . inf = infimum!
, is the k-th arrival time.
Also and is the interarrival time between and .
For , we write for and we call this the increment process on the interval = number of arravals between and .
Definition 7.1 (Definition of Poisson process via exponential interarrival times). is a Poisson process of rate if its interarrival times are i.i.d. with distribution.
Definition 7.2 (Definition of Poisson process via Poisson distribution of increments). is a Poisson process of rate if:
- N_0 = 0
- If are disjoint intervals , then the increments are independent. The number of points falling in disjoint intervals is independent.
- For any , the increment has Poisson distribution
The two definitions are equivalent. The key idea is that the memoryless property for the exponential distribution and the independent increments property are telling us the same thing.
To get more intuition for the relation between Poisson increments and exponential interarrivals, one can also think about a related discrete-time process.
- (1) If , then as . (See Example 2.9.)
- (2) If , then as . (See Example 2.3.)
So we can see this exponential/Poisson relationship in the Pois- son process as a limit of the geometric/binomial relationship which is already familiar from sequences of independent trials
Theorem 7.1 (Superposition of Poisson processes). Let L_t and $M_t be independent Poisson processes of rate and respectively. Let . Then Nt is a Poisson process of rate .
Theorem 7.2 (Thinning of a Poisson process). Let be a Poisson process of rate . “Mark” each point of the process with probability , independently for different points. Let be the counting process of the marked points. Then is a Poisson process of rate .
Problem sheets
Problem sheet 3.
Q1. closed classes: , (1-indexd) closed classes are always communicating classes.
, for second pard need chapman kolmogorv raise
Q2.
Q3. states 6 and not 6. need to get eigenvalue decomposition and raise the transition matrix to power n. Or condition of last state
=
Q5. basic logic
Q6.
- a) write transition matrix
- b) write all equations etc., answer is 2, alternatively consider state one and state two . Currently we are in state two. 0.5 probability to go from state two to one and 0.5 to stay in same state. By geometrix distribution you get expectation 2.
- c) write all equations (conditioning) etc.
- d) use c) + bayes equality, for later part need to compute probability win given we reach 10 starting from 2,4,6.
Q7.
- a) and condition of first step
- b)
- c) the maximal solution (look at theorem)
Q9. uniform distribution by symmetry