Spring 2009:
To be thinking about:
Ideas for paper(s) to read week of June 29 after finishing Chapter 4. Can be theoretical paper going more in depth on topics covered in Chapter 4 (e.g. read paper on generalized BP), theoretical paper extending Chapter 4 (e.g. Sudderth paper on Bethe approximation as lower bound in certain attractive graphical models), or application paper (e.g. vision paper applying generalized BP).
Ideas for topics to cover in fall semester. Some potential ideas: MCMC methods for inference, learning theory.
The reparameterization 4.27 in Proposition 4.3 is only possible because of the overcomplete set of indicator function sufficient statistics. Does using an even more redundant set of sufficient statistics lead to the existence of more reparameterizations and hence more local optima? The Bethe approximation does not give a lower or upper bound in general, unlike the mean field approximation, but in certain cases it can be convexified (Wainwright [246]) to give an upper bound, and in some models it is a lower bound (Sudderth [224]).
Clarified definition of hypergraphs, and noted that edges in hypergraph diagrams (e.g. Fig. 4.4) denote subset inclusion and are used for generalized belief propagation, they are not hyperedges themselves.
Using 4.41, we can see 4.42 can be put into a similar form as the distribution given by the junction tree, 2.12 on p32, by multiplying and dividing by \phi_h as necessary to get the marginals \mu_h for the maximal hyperedges in the numerator and \mu_h for the intersections between maximal hyperedges (separator sets in the junction tree) in the denominator.
Be able to derive 4.45 from 4.44 and 4.42.
Generalized BP is to the Kikuchi approximation as ordinary sum product is to the Bethe approximation. Kikuchi gives a better approximation to both the set of realizable mean parameters M and to the conjugate dual (negative entropy) A*.
On p103, in the center equation array, middle line, \omega(e,f) should be \omega(f,e).
The right hand side of Equation 4.17 should be reversed, \sum \tau_s - 1, to be consistent with a positive \lambda_SS in Equation 4.21a.
Definition of cumulant? Be able to prove second derivative of A as equation 3.41b in Proposition 3.1. Why does positive definite Hessian mean function is convex? Why is the gradient map \nabla A onto the interior of M, even for discrete random variables?
Discussed mean parameterization, the set of realizable mean parameters M, the example of M for a Gaussian, the convexity of M, M as a convex hull for discrete random variables, and the equivalent representation as the intersection of a finite collection of half-spaces.
Reviewed exponential family basics, including regular families (families for which the domain Omega is an open set) and overcomplete, using the example of a multinomial distribution. Covered details on the Ising model, its extension to multinomial variables, and Gaussian MRF (focusing on Theta as the negative definite precision matrix), and the dimensions of such families.
Clarified definitions of clique tree and junction tree. Showed how to derive 2.12 for the simple example of a directed tree. Proved 3.4 using Lagrange multipliers, from Cover and Thomas, Information Theory, Chapter 11. Digression on definition of sufficient statistics. Why "is this distribution a member of the exponential family" is not the right question to ask.
Worked out example of sum-product to calculate marginal at one node. Noted that \sum_i \sum_j i j = (\sum_i i) (sum_j j). Looked at extension of sum-product to simultaneously compute marginals at all nodes using 2 |E| messages. Noted that max can also be used in place of sum, and in fact, updates can be done on any commutative semirings, of which sum-product and max-product are specific examples.
Looked at junction tree algorithm. Noted that, using the terminology in the paper, a junction tree is a clique tree that satisfies the running intersection property. A clique tree can be constructed for any graph, but a junction tree can only be built on a triangulated graph. The nodes in a junction tree correspond to the maximal cliques of the triangulated graph, and the edges are set using maximum spanning tree, where the weight on an edge is the size of the separator set for the two nodes connected by the edge. Potentials are kept not only over the nodes in the junction tree but also the separator sets, and messages are sent by dividing by the stored potential in the separator set in order to avoid double-counting information. Worked through Example 2.2.
Specific issues covered - directed and undirected models, parameterizing undirected models using maximal cliques and non-maximal cliques, factor graphs, converting between directed, undirected models, and factor graphs, conditional independence, the need for moralization when converting from directed to undirected, conditional independence assumptions that can only be modeled by either directed or undirected models, explaining away, variable elimination, sum-product algorithm on trees, directed trees versus polytrees.
The main point of confusion was the exact difference between Product of Experts and Field of Experts. There were two aspects to this confusion. The first was whether or not Field of Experts was simply Product of Experts trained with overlapping patches. This is not the case, though it might appear so looking at equation 12. The key is that you have one normalizing factor Z for each entire image (that doesn't factor over each patch), whereas in Product of Experts you have one normalizing factor for each patch example.
The second point was the rationale behind coupling the patches this way. Although during training, one could pick out independent patches, when applying the model to an entire image, overlapping patches will be dependent, and one would want this dependence to be captured during the training process. The experiment results in section 5.3.5 validate this reasoning.
Why use log loss and Bayes Risk rather than 0-1 loss and Bayes Error? Then again, why use 0-1 loss rather than log loss?
Some confusion of r( ). r is a partitioning, for generative and fully discriminative models, there is only one partitioning, but for pseudolikelihood discriminative, there is one r_i for each node in the graphical model. For the generative model, there is only one partition that encompasses the entire outcome space Z. For the fully discriminative model, there is one partition for each data point x, and encompasses all possible labels Y. By maximizing the quantity (3), we are moving probability mass within a neighborhood r(z) and putting it on the point z. Introducing the r notation allows the generative, discriminative, and pseudolikelihood method to be discussed under one framework.
The take-away messages are Corollaries 1 and 2, and that discriminative models enjoy a O(n^{-1}) convergence in risk even when the model is misspecified, unlike the generative and pseudolikelihood methods, which matches the observation in Wainwright (2006) that one should use the same inference procedure in both training and testing.
What is the difference between the RBM and the Autoassociator Network? Both seem quite similar, especially with the constraint on the Autoassociator Network that the encoding functions are sigmoids and W^T = W^*, and training the RBM using one step contrastive divergence. However, there appear to be important differences, arising from the fact that the RBM is a generative probabilistic model and the Autoassociator Network is not.
In particular, the best approach seems to be to apply some linear combination of both the updates for the RBM and the Autoassociator Network, as discussed on p26. One hypothesis is that "there is no guarantee that the RBM will encode in its hidden representation all the information in the input vector, [while] an autoassociator is trying to achieve this".
This relates to the comment on p17 that, if, at some point in a deep belief network, the representation made of units at the highest level are mostly independent, then the next level would have no statistical structure to learn, and thus initialize the weights of the new RBM to zero (and appropriate set the bias weights for the penultimate layer). Conceivably, the RBM could also instead set very strong weights connecting a node in the penultimate layer only to the node in the top layer that is directly above it (a one to one connection). However, we are not sure which the RBM would actual do, using the training scheme given in the paper, and moreover, there is the further question of whether most of the information in the DBN would be contained in bias weights at intermediary levels of the DBN, which would not be as useful when constructing a classifier using the top layer of the DBN as input.
There was some confusion about independence versus correlation. Independence means p(a,b) = p(a)p(b), while two random variables are uncorrelated if E[AB] = E[A]E[B]. It's fairly easy to show that independence implies two variables are uncorrelated, but the reverse is not true. One simple example is X = -1, 0, 1, each with 1/3 probability, and Y = 1 if X = 0, otherwise Y = 0. Then X and Y are uncorrelated but clearly not independent.
There was some confusion also about PCA. Here is one summary [16].
One other question was how to go from the minimization (1) to the minimization (2), any why the eigenvectors are the solution to the relaxed version of (2). If you expand the ||y_i - y_j||^2 terms into y_i^2 + y_j^2 + 2 y_i y_j for each bit, you should be able to rearrange terms to get it to be 2 times the objective function of (2).
Here are two sets of lecture notes related to this [17] and [18]. These course notes deal with graphs and their Lapalcians (the D-W term), which may help in understanding part of the paper. In particular, this set of notes [19] contains a proof of the Courant-Fischer theorem, which proves that the minimal value of the relaxed problem (2) is the minimal eigenvalue.
What is intelligence? Is it behavior, or memory, or something else? General agreement that the Serle Chinese room argument is flawed. If intelligence is behavior, then is the Turing test an appropriate measure of intelligence? Does the fact that a system like ELIZA [21] can apparently fool some laypeople into believing that they are talking to an actual person, despite not having what most AI practitioners would consider real intelligence, indicate a flaw in the Turing test?
If intelligence is memory, then is something like Torralba's 80 Million Tiny Images [22] the path to AI? Can general AI really be solved by such an approach? Can such an approach provide an interpretation of a conference room, handling the variety of possible chairs, tables, faces, viewing angles, etc?
What will a general AI solution require? Hawkins argues that any approach must have a time component, mimic the physical architecture of the human brain, and have many feedback connections. What else will have to be a part of any general AI solution? Higher-order feature induction?
Fall 2008:
Spring 2007:
Previous meetings:
Other interesting articles/links