Probability

Oxford notes for Part A probability.

Buzzwords

Probability space (Θ,F,P)(\Theta, F, P) = modelling an experiment. Θ\Theta is the space of outcomes, FF is the space of events, PP gives a measure of probability.

Memoryless property of geometric and exponential distributions = kvot bilo bilo.

close connections between the Poisson distribution and the exponential distribution - think Poisson processes.

XnXX_n \rightarrow X almost surely (with probability 1) implies convergence in probability implies convergence in distribution.

WLLN

Weak law of large numbers is limit in probability. Proven using Chebyshev's inequality (requires finite variance)

This assumption of finite variance is not necessary (just makes proves easier). Large or infinite variance will make the convergence slower, but the LLN holds anyway.

Strong Law of Large Numbers (SLLN) is almost sure convergence.

CLT, converges in distribution. The fluctuations of the mean XnX_n around μ\mu is of order 1/n1/\sqrt{n}

risk pooling (used in CLT) Sn=X1+...XnS_n = X_1 + ... X_n, where X1...XnX_1 ... X_n are different portfolios (iid assumption!).

Binomial with small probability and many events is poisson

Binomial = sum of Beornoulli. If fixed pp, then by CLT Binomial converges to Normal distribution. Consider Wn Bin(n,pn)W_n ~ Bin(n,p_n), where pnp_n converges to 0. Assume the expected total number of successes stays approximately the same npn=λnp_n = \lambda. Then Binomial converges to Poisson.

generating functions are used to compute moments and pmfs. Take kth derivative and evaluate at 0 or 1.

Pgf-s used for discrete random variables E(zX)E(z^{X}).

Mgf-s used for continuous random variables E(etX)E(e^{tX}).

Mgfs and Pgfs define uniquely the distribution of random variables.

For MX1,MX2...MXnM_{X_1},M_{X_2} ... M_{X_n} if MXnMXM_{X_n} \rightarrow M_{X} then XnXX_{n} \rightarrow X in distribution as nn \rightarrow \infty.

Proof WLLN CLT using mgfs and above fact

Chebyshev’s inequality, which is the application of Markov’s inequality to the random variable (Xμ)2(X − \mu)^2.

Markovs inequality is about bouding the tails of non-negative distributions. P(X>a)E(X)aP(X>a) \leq \dfrac{E(X)}{a}

Fact: For standard normal random variable P(X>3)103P(X > 3) \approx 10^{-3}

Markov chain is a sequence of variables satisfying the Markov property

P(Xn+1=in+1Xn=in,...,X0=i0)=P(Xn+1=in+1Xn=in)P(X_{n+1} = i_{n+1} | X_n = i_n ,..., X_0 = i_0) = P(X_{n+1} = i_{n+1} | X_n = i_n).

To describe a time homogeneous Markov chain you need the initial distribution of X0X_0 and a transition matrix P=(pij)i,jIP = (p_{ij})_{i,j \in I}

The matrix PP is indexed by the state space II. PP has non-negative entries and each row sums to 1.

Markov chains are memoryless.

Chapman-Kolmogorov equations

irreducible chains, communicating classes, closed class, periodicity of classes

recurrence, transience, mean return time, null recurrence, positive recurrence

null recurrent = mean return time is infinite but probability of goiing back infinitely many times is 1.

in a class all states are either positive recurrent or null recurrent, or transient

Random walk on Zd\Z^{d} is a irreducible chain with period 2.

iid is subcase of Markov chain

Ergodic theorem is generalization of SLLN

Chapter 1. Recap Prelims

  • (i) ΩΩ is a set, which we call the sample space.
  • (ii) FF is a collection of subsets of ΩΩ. An element of FF is called an event.
  • (iii) The probability measure PP is a function from FF to [0, 1]. It assigns a probability to each event in FF.

A random variable is a function defined on ΩΩ. Maps events to numbers.

cdf, pdf, discrete and cts random variables, variance, covariance

independence, P(ABC)=P(A)P(B)P(C)P(A \cup B \cup C) = P(A)P(B)P(C). pairwise indepence is weaker than independence.

If XGamma(rX,λ)X ∼ Gamma(rX , λ) and YGamma(rY,λ)Y ∼ Gamma(rY , λ) are independent, then we have X+YGamma(rX+rY,λ)X + Y ∼ Gamma(rX + rY , λ). As a special case, if X1,X2,...,XnX_1 , X_2 , . . . , X_n are i.i.d. with Exp(λ)Exp(λ) distribution, then X1+X2++XnX_1 + X_2 + · · · + X_n has Gamma(n,λ)Gamma(n, λ) distribution.

Memoryless property of geometric and exponential distributions.

If XPoisson(λ)X ∼ Poisson(λ) and YPoisson(µ)Y ∼ Poisson(µ) are independent, then X+YPoisson(λ+µ).X + Y ∼ Poisson(λ + µ).

Chapter 2. Convergence of random variables and limit theorems.

We need to formalise the notion that two random variables XX and YY are close.

Modes of convergence

Let X1,...Xn,XX_1, ... X_n, X be random variables. We say:

XnX_n converges to XX almost surely (with probability 1) if P(XnX as n)=1P(X_n \rightarrow X \text{ as } n \rightarrow \infty) = 1

XnX_n converges in probability to XX if for every ϵ>0\epsilon > 0: P(XnX<ϵ)1 as nP(|X_n - X| < \epsilon) \rightarrow 1 \text{ as } n \rightarrow \infty.

XnX_n converges to XX in distribution(weakly) if Fn(x)F(x)F_n(x) \rightarrow F(x) as nn \rightarrow \infty for all xx where F(x)F(x) is continuous.

These formulations are in decresing order of strength. The last one does not require distributions to be the same! This is really a definition about convergence of distributions, not about convergence of random variables.

There are many situations in which a sequence of discrete random variables converges to a continuous limit.

WLLN in convergence in probability.

Sn=X1+...XnS_n = X_1 + ... X_n, where X1,...XnX_1, ... X_n are iid with finite variance. Then:

P(Snnμ<ϵ)1P(|\dfrac{S_n}{n} - \mu| < \epsilon) \rightarrow 1 as nn \rightarrow \infty proove it using Chebyshevs inequality.

CLT - convergence in distribution (standard normal). The fluctuations of Sn=X1+...+XnS_n = X_1 + ... + X_n around nμn\mu is of order n\sqrt{n}.

Binomial with small probability and many events is poisson

Binomial = sum of Beornoulli. If fixed pp, then by CLT Binomial converges to Normal distribution. Consider Wn Bin(n,pn)W_n ~ Bin(n,p_n), where pnp_n converges to 0. Assume the expected total number of successes stays approximately the same npn=λnp_n = \lambda. Then Binomial converges to Poisson.

Chapter 3. Generating functions

The generating function of random variable XX is G(z)=E(zX)G(z) = E(z^{X})

Using generating dunctions you can compute moments and the pmf of XX. Take kth derivative and evaluate at 0 or 1.

Theorem 3.2 (Uniqueness theorem for probability generating functions). If X and Y have the same generating function, then they have the same distribution.

Generating functions for independen rvs: GX+Y(z)=GX(z)GY(z)G_{X+Y}(z) = G_{X}(z)G_{Y}(z)

Generating functions used for random sums: GS(z)=GN(GX(z))G_{S}(z) = G_{N}(G_{X}(z))

Probability genrating functions are used for discrete random variables!

Moment generating function of a random variable is MXt=E(etX)M_X{t} = E(e^{tX})

Mgf is Pgf with z=etz = e^t

We use ete^{t} because we can expand around t=0t=0. If we expand around z=0z=0 we would no longer have power series in the general case for cts variables.

Assuming the mgf exists (the expecation is finite), Taylor expansion:

MX(t)=k=0tkE(Xk)k!M_{X}(t) = \sum_{k=0}^{\infty} \dfrac{t^{k}E(X^k)}{k!}

MX(k)(0)=E(Xk)M_{X}^{(k)}(0) = E(X^{k})

Mgfs and Pgfs define uniquely the distribution of random variables.

For MX1,MX2...MXnM_{X_1},M_{X_2} ... M_{X_n} if MXnMXM_{X_n} \rightarrow M_{X} then XnXX_{n} \rightarrow X in distribution as nn \rightarrow \infty.

Fact For standard normal random variable P(X>3)103P(X > 3) \approx 10^{-3}.

Characteristic functions

Replace in mgf tt with itit

ϕX(t)=E(eitX)=E(cos(tx))+iE(sin(tX))\phi_{X}(t) = E(e^{itX}) = E(cos(tx)) + iE(sin(tX)).

Results for mgf hold for cgf too.

Mgf vs Cgf

Cgf does not require exponentially decaying tails (it is all on complex plain). Mgf to proove CLT uses exponential decay (finite expectation).

Mgf could be used to show bounds of tails (using Markov inequality, see section 3.3). CGf just does not makes sense as we use complex numbers and cannot compare them.

Chapter 4. Joint distributions of continuous random variables

Chapter 5. Markov chains

Let X=(X0,X1,X2,...)X = (X_0 , X_1 , X2 , ...) be a sequence of random variables taking values in some state space II. The process XX is called a Markov chain if for any n0n \geq 0 and any i0,i1,...,in+1Ii_0 , i_1 , . . . , i_{n+1} \in I,

P(Xn+1=in+1Xn=in,...,X0=i0)=P(Xn+1=in+1Xn=in)P(X_{n+1} = i_{n+1} | X_n = i_n ,..., X_0 = i_0) = P(X_{n+1} = i_{n+1} | X_n = i_n).

The Markov chain is called (time) homogeneous if in addition P(Xn+1=jXn=i)P(X_{n+1} = j| X_{n} = i) does not depend on nn. Then pij=P(Xn+1=jXn=i)p_{ij} = P(X_{n+1} = j| X_{n} = i) and these are called transition probabilities of the chain.

To describe a time homogeneous Markov chain you need the initial distribution of X0X_0 and a transition matrix P=(pij)i,jIP = (p_{ij})_{i,j \in I}

The matrix PP is indexed by the state space II. PP has non-negative entries and each row sums to 1.

Markov chains are memoryless.

We can say that “the future is independent of the past, given the present”.

Theorem 5.2. (Chapman-Kolmogorov equations)

  • To reach from ii to kk in m+nm+n steps try middle: pikm+n=pijmpjknp_{ik}^{m+n} = p_{ij}^{m} p_{jk}^{n}
  • To reach from ii to jj in nn steps: pijn=(Pn)i,jp_{ij}^{n} = (P^n)_{i,j}

Theorem 5.3 If λ\lambda is the initial distribution, then the distribution of XnX_n is λPn\lambda P^{n}.

Class structure of state space of Markov chains.

We say state ii communicates with state jj if we can reach from iji \rightarrow j, that is pijnp_{ij}^{n} is positive for some nn (after certain number of steps we can reach from ii to jj) and vice verca.

A class of states CC is closed if the probability to go out of the class is 0.

If the class {i}\{i\} is closed then we call it an absorbing state.

A state II is irreducible if all states communicate (can reach from elsewhere to everywhere).

Period of a state ii is the least number of steps after which you will be back and back and back ... in ii, that is the gcd of the set {n1:piin}\{ n \geq 1: p_{ii}^{n}\}, if it does not exist (never go back, p_{ii}^{n} = 0)then the period is not defined.

ii is called aperiodic if this g.c.d. is 1. Example: go back in 1, 3, 7 steps

Fact. All states in a communicating class have the same period. In particular, if a chain is irreducible, then all states have the same period.

Hitting probabilities

Define hiA=Pi(XnAfor somen0)h_{i}^{A} = P_{i}(X_{n} \in A \texttt{for some} n \geq 0), the hitting probability of AA starting from state ii. If AA is a closed class we call that the absorbtion probability.

Recurrence and transience

Pi(Xn=ifor some n1)=p1P_{i}(X_{n} = i \texttt{for some } n \geq 1) = p \leq 1 equivalently Pi(git i infinitely often)=0P_{i}(\texttt{git } i \texttt{ infinitely often}) = 0. The state ii is called transient.

Pi(Xn=ifor some n1)=1P_{i}(X_{n} = i \texttt{for some } n \geq 1) = 1 equivalently Pi(git i infinitely often)=1P_{i}(\texttt{git } i \texttt{ infinitely often}) = 1. The state ii is called recurrent.

Theorem 5.9 In a recurrent class: either all states are recurrent or all are transient.

Every recurrent class is closed. Every finite closed class is recurrent.

The theorem tells us that recurrence and transience are quite boring for finite chains: state ii is recurrent if and only if its communicating class is closed. But infinite chains are more interesting! An infinite closed class may be either transient or recurrent.

Therem State ii is recurrent iff n=0pii(n)=\sum_{n=0}^{\infty}p_{ii}^{(n)} = \infty

Random walk in Zd\Z^{d}

Consider a simple symmetric random walk on the dd-dimensional integer lattice. This is a Markov chain with state space Zd\Z^{d} and transition probabilities pxy=1/(2d)p_xy = 1/(2d) if xy=1|x − y| = 1, and pxy=0p_xy = 0 otherwise. The chain is irreducible, with period 2.

For d=1,2d=1,2 the chain is recurrent (probability of go back to 0 infinitely often is 1), and for d3d \geq 3 the cahin is transient.

Mean Return Time to a state

mi=E(start from i\texttandgobacktoi)=1+pijkjim_{i} = E(\texttt{start from }i\textt{ and go back to }i) = 1 + \sum p_{ij}k_{j}^{i}

where kjik_{j}^{i} is the mean hitting time of ii starting from jj

If ii is transient then mi=m_i = \infty (return time to itsleft is infinite with positive with probability).

If ii is recurrent and mi=m_i = \infty, we say ii is null recurrent

If ii is recurrent and mi<m_i < \infty, we say ii is positive recurrent

If the chain is irreducible, we can therefore call the whole chain either transient, or null recurrent, or positive recurrent.

Chapter 6. Markov chains: stationary distributions and convergence to equilibrium

Let π=(πi,iI)\pi = (\pi_{i} , i \in I) be a distribution on the state space II. We say that π\pi is a stationary distribution, or invariant distribution, or equilibrium distribution, for the transition matrix PP if πP=π\pi P = \pi

π\pi is a left eigenvector of the matrix PP with eigenvalue 1.

That is for all jj πj=iπipij\pi_{j} = \sum_{i}\pi_{i} p_{ij} zzz Stationary distributions are those for which after we move using transition matrix, the distribution does not change.

Theorem 6.1 (Existence and uniqueness of stationary distributions). Let PP be an irreducible transition matrix.

  • (a) PP has a stationary distribution if and only if PP is positive recurrent.
  • (b) In that case, the stationary distribution π is unique, and is given by πi=1/mi\pi_{i} = 1/m_{i} for all ii (where mim_{i} is the mean return time to state ii defined at (5.12)).

Theorem 6.2 (Convergence to equilibrium). Suppose PP is irreducible and aperiodic, with stationary distribution π\pi. For a Markov chain with transition matrix PP and any initial distribution P(Xn=j)πjP(X_n = j) \rightarrow \pi_{j} as nn \rightarrow \infty for all jj.

Theorem 6.3 (Ergodic theorem). Let PP be irreducible. Let Vi(n)V_i(n) be the number of visits to state ii before time nn.

Then Vi(n)n1mi\dfrac{V_i(n)}{n} \rightarrow \dfrac{1}{m_i} almost surely as nn \rightarrow \infty.

The ergodic theorem concerns the “long-run proportion of time” spent in a state.

In the positive recurrent case, the theorem says the long-run proportion of time Vi(n)n\dfrac{V_i(n)}{n} spent in a state ii is the stationary probability of that state 1mi=πi\dfrac{1}{m_i} = \pi_{i}.

In the null-recurrent or transient case, 1/mi=01/m_{i} = 0, so the ergodic theorem says that with probability 1 the long-run proportion of time spent in a state is 0. zz We can see the ergodic theorem as a generalisation of the strong law of large numbers. The ergodic theorem can be seen as extending this to the case where Xn is not i.i.d. but is a Markov chain. IID is stronger asumption that Markov propery.

Intuition about the convergence theorems.

The idea will be that after a long time, a Markov chain should more or less “forget where it started”. There are essentially two reasons why this might not happen:

  • (a) periodicity; for example if a chain has period 2, then it alternates between, say, “odd” and “even” states; even an arbitrarily long time, the chain will still remember whether it started at an “odd” or “even” state.

  • (b) lack of irreducibility. A chain with more than one closed class can never move from one to the other, and so again will retain some memory of where it started, forever

Thus for convergence to equillibrium (which does not depend on initial distribution) we require aperiodicity and irreducibility

Poisson processes

A Poisson process is a natural model for a stream of events occuring one by one in continuous time, in an uncoordinated way.

Example:

  • the process of times of detections by a Geiger counter near a radioactive source (a very accurate model)
  • the process of times of arrivals of calls at a call centre (often a good model)
  • the process of times of arrivals of buses at a bus stop (probably an inaccurate model; different buses are not really uncoordinated, for various reasons

Definition A counting process is a random process NtN_t for t[0,]t \in [0,\infty], where time tt is continuous and NtN_t is a random variable which takes values in {1,2,3...}\{1,2,3...\} and Ns<NtN_s < N_t for s<ts < t.

Arrival process

If NtN_t describes an arrival process, then Nt=kN_t = k means that there hae been kk arrivals in the time interval [0,t][0,t].

In fact we can describe the process by the sequence of arrival times, which we might call “points” of the process. Let Tk=inf{t0:Ntk}T_k = inf\{t ≥ 0 : N_t ≥ k\} for k0k ≥ 0. inf = infimum!

T0=0T_0 = 0, TkT_k is the k-th arrival time.

Also Yk=TkTk1Y_k = T_k - T_{k-1} k1k \geq 1 and YkY_k is the interarrival time between k1k-1 and kk.

For s<ts < t, we write N(s,t]N(s, t] for NtNsN_t − N_s and we call this the increment process NN on the interval (s,t](s,t] = number of arravals between ss and tt.

Definition 7.1 (Definition of Poisson process via exponential interarrival times). (Nt,t0)(Nt , t ≥ 0) is a Poisson process of rate λ\lambda if its interarrival times Y1,Y2,Y3,...Y_1 , Y_2 , Y_3,... are i.i.d. with Exp(λ)Exp(\lambda) distribution.

Definition 7.2 (Definition of Poisson process via Poisson distribution of increments). (Nt,t0)(Nt , t ≥ 0) is a Poisson process of rate λ\lambda if:

  • N_0 = 0
  • If (s1,t1),...(sk,tk)(s_1,t_1), ... (s_k,t_k) are disjoint intervals R+\R_{+}, then the increments N(si,ti]N(s_i,t_i] are independent. The number of points falling in disjoint intervals is independent.
  • For any s<ts < t, the increment N(s,t]N(s,t] has Poisson distribution λ(ts)\lambda(t-s)

The two definitions are equivalent. The key idea is that the memoryless property for the exponential distribution and the independent increments property are telling us the same thing.

To get more intuition for the relation between Poisson increments and exponential interarrivals, one can also think about a related discrete-time process.

  • (1) If XnBinomial(n,λ/n)X_n ∼ Binomial(n, λ/n), then XnPoisson(λ)X_n → Poisson(λ) as nn \rightarrow \infty. (See Example 2.9.)
  • (2) If YnGeometric(λ/n)Y_n ∼ Geometric(λ/n), then Yn/nExp(λ)Y_n/n → Exp(λ) as nn \rightarrow \infty. (See Example 2.3.)

So we can see this exponential/Poisson relationship in the Pois- son process as a limit of the geometric/binomial relationship which is already familiar from sequences of independent trials

Theorem 7.1 (Superposition of Poisson processes). Let L_t and $M_t be independent Poisson processes of rate λλ and µµ respectively. Let Nt=Lt+MtN_t = L_t + M_t. Then Nt is a Poisson process of rate λ+µλ + µ.

Theorem 7.2 (Thinning of a Poisson process). Let NtN_t be a Poisson process of rate λλ. “Mark” each point of the process with probability pp, independently for different points. Let MtM_t be the counting process of the marked points. Then MtM_t is a Poisson process of rate pλ.

Problem sheets

Problem sheet 3.

Q1. closed classes: {3},{1,5}\{3\},\{1,5\}, {1,2,5}\{1,2,5\} (1-indexd) closed classes are always communicating classes.

(1/5,1/10,7/20,1/5,3/20)(1/5,1/10,7/20,1/5,3/20), for second pard need chapman kolmogorv 9/169/16 raise P2P^2

Q2. pii=1(NiN)2(iN)2,pi,i1=(iN)2,pi,i+1=(NiN)2p_{ii} = 1 - (\dfrac{N-i}{N})^2 - (\dfrac{i}{N})^2, p_{i,i-1} = (\dfrac{i}{N})^2, p_{i,i+1} = (\dfrac{N-i}{N})^2

Q3. states 6 and not 6. need to get eigenvalue decomposition and raise the transition matrix to power n. Or condition of last state pn=pn14515p_n = p_{n-1} \dfrac{4}{5} \dfrac{1}{5}

P(return1after n steps)P(\texttt{return} 1 \texttt{after n steps}) = 15(1P(return to6)\dfrac{1}{5}(1-P(\texttt{return to} 6)

Q5. basic logic

Q6.

  • a) write transition matrix
  • b) write all equations e8=e10/2+e6/2+1e_8 = e_{10}/2+e_{6}/2+1 etc., answer is 2, alternatively consider state one {0,10}\{0,10\} and state two {2,4,6,8}\{2,4,6,8\}. Currently we are in state two. 0.5 probability to go from state two to one and 0.5 to stay in same state. By geometrix distribution you get expectation 2.
  • c) write all equations (conditioning) p8=p10/2+p6/2p_8 = p_{10}/2 + p_{6}/2 etc.
  • d) use c) + bayes equality, for later part need to compute probability win given we reach 10 starting from 2,4,6.

Q7.

  • a) pi+qi=1p_{i}+q_{i} = 1 and condition of first step
  • b) hi=1u1k=1iγkh_{i} = 1 - u_{1}\sum_{k=1}^{i}\gamma_{k}
  • c) the maximal solution (look at theorem)

Q9. uniform distribution by symmetry