Last update: August 13. More answers will follow. Questions Asked by Attendee Q: have you tried images not in your database at all A: Yes, usually for detection. For example, I used a trained network to detect objects from images downloaded from the internet. Q: I meant test images not in your testing database A: please see previous answer Q: at 2nd convolution layer, do you share weights between different feature maps A: yes Q: Hi, in a max-pooling layer, the maximum is taken without considering the sign of the outputs? A: the sign IS considered Q: Why? A: not considering the sign makes the network not to train at all. It is irrelevant if one uses maximum or minimum, as long as this is consistent during training and testing. Q: What about detection of morphing images in real-time? A: I do not understand the question. Is it about testing distorted images? It can be done, but I do not see the use case. Q: Is my question going to be answered or disregarded? Or has the answer already been provided? A: Unfortunately, Dan can't answer all the questions that have come in. If you question is not answered during the presentation, Dan will answer it offline via email. Q: Why cant MLP be used for classifying images if the image is supplied as a vector A: An MLP can be used for classifying images, but is not very good for this task compared with a dedicated architecture (like CNN). An MLP does not use the information regarding vertical neighboring. Q: Is there a possibility to use DNN for Medical Image segmentation in Real time? A: Yes, as long as there is enough computing power, like a machine with several GPUs. It all depends on the required frame rate, the size of the DNN and dimension of input images. Q: What was the reason for the improvement that you show over previous algorithms? Was it simply a deeper network or was there something more clever involved? A: Several reasons: DNN is both deeper (more layers) and wider (the layers are larger) than previous models. I think also SGD (stochastic gradient descent) helped. Most of the nets are trained in batch mode because it is faster. On the fly distortions during training are very important for generalization if the training set is not big enough. Averaging several independently trained networks further improves recognition by up to 30%. Q: what is the bias A: there is a bias neuron set to a fixed value of one. Its outgoing weights are used for training, i.e. they permit full affine transformation on the input vector. Most importantly, the bias term is used as a variable threshold. Q: On GPU do the nets use double precision? A: single precision is enough. I only use double precision for the output layer because there are log functions which are sensitive to precision. Q: Were any stability/computation issues introduced by differences in IEEE 754 rounding between CPU and GPU? A: they were minor, at the limit of floating point precision Q: Should the size of your first fully connected layer equal the unrolled size of your last pooling layer? A: I was never considering this, but I do not see why it should have the same size. Q: How can we deduce number of layers / net architecture from input image size and number of categories? A: One heuristic I use: deep nets are better than shallow nets => we need more layers (~10) => kernel sizes have to be small. Experimenting with several NN architectures on the validation data also helps. Usually, most NN perform similarly, except if they have too small (shallow, not wide enough) Q: When averaging the results of the committee is that a simple averatge or something else, such as picking the results within one std deviation? This might require a larger committee? A: I tried several types of averaging, including training multiple nets at the same time, but the simples averaging was best. Q: What is te key insight that resulted in such dramatic improvement in performance A: deep and wide nets trained with SGD on distorted images. averaging independently trained nets. Q: Lecun observed that for convnets normalization is key to good performance. Do you use a nonlinearity+normalization step? A: the nonlinearity is in the activation function (tanh). I do not use normalization inside the net. Q: Was the brain segmentation network trained on hand-labelled data? A: yes. More details can be found on my NIPS 2012 paper or on the competition's website. Q: Which criteria is used to select which outputs of the first convolutional layer are mixed to get the inputs of the second convolutional layer and so on? A: I had the same question when I started to work with CNN and I tried various ways of connecting the layers, including random. In the end I decided full connection is probably close to optimal. Q: Dan, how to make sure the neuron (max-pooling) is really max ? there might be a case the 1st max and 2nd max is quite near ? can we choose 2 instead of just max ? A: This can be a problem. I do not do anything special in this case; the first maximum (if there are several neurons with exactly the same output) is used. I tried back propagating to the highest n values (usually n=2 or n=3) but I could not see an improvement in the limited experiments I did. Q: if you have new classes of objects coming into the training dataset, do you have to retrain the entire neural net or just top layers? A: I have a paper about transfer learninig at IJCNN 2011. You can find the pdf on my webpage. The paper contains extensive tests. Q: How do you determine the best CNN structure (nr and size of layers etc.)? A: from experience I know that a CNN performs best if it has at least 2-3 stages of convolutional-maxpooling layers. The MLP at the end should have at least one hidden layer in addition to the output layer. A NN should not have more than 10-12 layers becaue there will be problems with single precision floating point computation. Most architecture will perform similarly, thus fine tunning is necessary only when trying to get the maximum performance achievable. Q: How is the size of a max-pooling or a convolution kernel chosen? A: Once we know the input size we try to build a net with 6-12 layers. Usually, the convolutional filters are small, 2x2 to 5x5 but they can be bigger if the input image is big enough. Maxpooling kernels are most of the time 2x2. If the input image is >=100, then bigger kernels can be used. Q: How is the number and type of hidden layers chosen? A: #layers is between 6 and 12. Q: What are the training differences between a MLP and a CNN? I understand backpropagation with a MLP network. How are the 3x3 weights for the filters in the CNN selected? Does it use backpropagation or are these weights chosen using some other method? A: the training algorithm is exactly the same. Details: on a maxpooling layer the error is backpropagated only to the winning neuron durring forward propagation; the shared weights present in a convolutional layer are updated for every connection they are used. The 3x3 (or whatever size) are randomly initialized and trained with SGD (see IJCAI 2011). Q: Once the DNN is trained, how is the computational cost of the prediction (for example when testing) compared to a linear SVM? Is it much higher? A: I do not know. On a GTX 580 with a medium size CNN one can test 2000-10000 digits/s. On a CPU with a very optimized code 200-1000digita/s. Q: What software or program do you recomend for CNN? A: I know there are several libraries but I have never tried them, thus I cannot recommend one. When I started writing my code there was no library, and I had to write my own code. Q: How do you come up the filter description A: I do not know what is filter description. If it is about filter size, I mostly use small filter sizes, like 2x2-5x5. Q: What postprocessing did do for the medical data? A: For neuronal membrane segmentation we only avearged the output of several nets, then smoothed the result (see NIPS 2012). For mitosis detection we used non-maxima supression (see MICCAI 2013). Q: How do you average NN with different number of parameters? The parameter matrices don't even have the same dimension A: I do not average the internal part of the nets. Only the outputs (predictions) are averaged. You can even average different method as long the output have the same meaning and scale. Q: Was the performance of the DNN (specificity/sensitivity) compared to that of other classification models like Random Forests for a similar computational effort? A: I do not have a definitive answer for this because I do not have fast code for other methods. Q: Although CIFAR doesn't allow for it, would recognition be improved with object rotation sequences and temporal associativity such as trace rule as booster? A: --unanswered-- Q: What tecnologies/languages do you consider to be the most useful for developing Deep Neural Networks nowadays? A: I do not think there is only one answer here. I use C/C++/CUDA. Q: Do you believe Deep Neural Networks could be the best option for recognizing 3D images? A: Yes, as long you have a big enough labeled training set. Q: Can you show us some results on biomedical image segmentation? A: Please have a look at the dedicated slides, or at my NIPS 2012 and MICCAI 2013 papers. Q: Can you please remind us of the web site from which we can download the slides? A: The recording and a video of Dan's slides will be available at http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php. We will send you an email with the link in a few days. Q: What do you feel are the most significant innovations in CNNs since your 2011/2012 work? A: --unanswered-- Q: Has max pooling been shown to work better than other types of pooling, such as average pooling? Or are they comparable? A: I have tried many types of pooling operations, like average, absolute, vertical pooling. Max-pooling worked always the best. Q: How different are the model setups for different applications (eg detecting different types of characters)? A: If the input size is the same, then the network can be used for different character sets or even other types of images. Q: Have you looked into training models that don't fit on a single GPU? A: Yes, but I didn't publish any results yet. Q: How will new GPU features (eg unified memory) impact deep learning on GPUs? A: Unified memory is mostly about writing code easier, not making it faster. Q: How do you select the network architecture? Is it simply trial-and-error or is there a more systematic approach? A: --unanswered-- Q: When training a deep neural network, any experiences that you can share with us on selecting hyperparameters such as learning rate? A: --unanswered-- Q: Also, when training a deep neural networks, even when the neural network is large enough, there are some times high training error (I guess due to local minimum), how to deal with that? A: --unanswered-- Q: Where does the +1 weight come from? A: It is for connecting with the bias neuron. Q: Could you tell us how you were able to find the bug in the chinese character competition? A: --unanswered-- Q: How do you calculate the size of the different layers in the CNN after you set the size of the first layer? A: --unanswered-- Q: Are there any drawbacks othen than computation speed with deep convolutional neural networks? A: Considering they are state of the art no so many datasets, we can safely asume they have the smallest number of drawbaks. Q: is max pooling the best pooling? what other pooling strategies have you explored? A: I have tried many types of pooling operations, like average, absolute, vertical pooling. Max-pooling worked always the best. Q: What was the output encoding used in your NN for ICDAR 2011 2013 competitions? What would you recommend for encoding output with many thousands of discrete classes? A: 1 for correct label, 0 for incorrect label. It worked with ~4000 classes. Q: I'm curious for more details about why and how you utilized four GPUs simultaneously. A: Each GPU runs a different net, thus they are completely independent. Q: Regarding three input maps: For a given neuron that they project to, is it the same exact convolutional kernel being applied to all three maps? A: --unanswered-- Q: can you provide the link? A: The link to the recording of this webinar will be located at http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php in a few days. We will email everyone once it is live. Q: A NN can be used to retain (have in memory) a lot of shapes (for RS or GIS procesess) andbe used with another data set and classify the feature from the second image let's say or the NN need to be traind with the new dataset? A: --unanswered-- Q: How are you implementing the even kernel sizes (eg, 2x2)? What "center pixel" are you using? Top, bottom...? A: --unanswered-- Q: CNN is method of learning set of no-stationary kernels? A: --unanswered-- Q: How do we validate CNN has learned optimized number of kernels A: --unanswered-- Q: Where the traffic signs already localized or did you have to detect them first? A: The competition was about classification. Please see GTSRB competition or my IJCNN 2011 & NN 2012 papers. Q: For the chinese character recognition, how did you avoid overlearninng given only 270 samples per class? A: Overlearning was not a problem for the Chinese character recognition. Q: (how many features were propigated to the last layer of the NN) A: --unanswered-- Q: Can you dervive any meaning from the features that propiggate too the last layer of the NN? A: --unanswered-- Q: What are the different ways to pre-process the input data before running the CNN? Or is it case that we apply the CNN directly on the RGB input? What are the general best practices when working with CNNs? A: I use the raw RGB data. Sometimes contrast normalization helps. Q: For the various competitions you applied dnn's on, did you use the same dnn architecture or did it vary between competitions? A: The architecture varied because the tasks were different and had different image size. Q: Have you explored FPGAs for DNNs? What's your opinion on this? A: By definition FPGAs are slower than ASICs. A GPU is mostly a computing machine, thus almost all tranzistors are used for computation. Now that there is K1, probably FPGAs are useful only for very compact systems. Q: Have you considered using temporal dependencies of input data - eg. for video data? A: --unanswered-- Q: Have you explored using fixed point represention? how does it impact the dnn's performance? A: --unanswered-- Q: Do you think Deep Neural Networks can be extended to perform feature extraction and selection in an image retrieval pipeline ? A: --unanswered-- Q: can you give the CPU type and config A: i7, one thread Q: the CPU config type for the 27 hour computation? A: i7, one thread Q: Do you think gpus are the best choice for this work, how about the intel MIC A: their architecture is very different and much more flexible. Probably they are comparable with GPU, at least for some ML algorithms. For CNN MIC might not be the best solution because they are mostly a cluster packed in a chip, thus there is still much more communication latency compared with a GPU. Q: Would you expect for medical tasks, that full analisis on 3d data can help us to make classifiers that can go trough a full body MRI scan, and classify as healty or see a doctor A: probably yes Q: Do you have tried or do you know if anyone has tried to use CNN in recurrent networks? I mean, CNN are usually used for fixed-size input images. But what if the images have different sizes. Have you tried to combine CNN with RNN, like BLSTMs, for instance? A: I remember reading about recurrent CNN, but I have no clear pointer for you. There are several methods that can be used when the images have a different resolution. The simplest one si to rescale everything to a fixed resolution. Sermanet and LeCun have papers with multiple scale CNN, although I think that hapens inside the net, not at the input. Q: is the size of the pooling (i.e. 2x2 or 3x3, etc) fixed during construction or is it learned? A: it is fixed before training. Q: What about running NN on hadoop clusters? Is that considered as bad practice? A: --unanswered-- Q: Would adding distortions help in Chinese character recognition? A: I already used distortions, but they did not help. Q: Have you tried unsupervised learning to help extract representations of the raw inputs, using such as stacked RBMs? A: Yes, several years ago on MNIST, and I didn't get any improvement. Q: How do you decide on the number of layers and the number of neurons to use? Is there any general rule? A: --unanswered-- Q: Besides ensemble and adding distortions, what other tricks do you have to imporve a NN's performance? A: These two are by far the most important. Q: Would use videos (with temporal info) instead of images potentially improve performance for this kind of tasks? A: Probably. I have some work in progres with this type of data. Q: Hi Dan, Is there any plan to make the CUDA implementation open source? And if I want to learn more about how to implement CNN on CUDA, could you point some resources for me to start? A: There are several libraries (torch, teano) that can be used to build a CNN. I have a start-up and I cannot share my code. Q: Could you elaborate on the initialization of the filters for CNN? A: Random with values from -0.05 to 0.05. You can use a different interval, if the network does not train properly. Q: How important is it to the convergence of the training? A: The initial values and the learnnig rate should match. If the weights are very small and the learning rate too big, the net will not learn. Q: When doing max pooling is there superposition of the filter for pooling? A: One can use superposition. Q: For the retine vessel segmentation, could you indicate the sensitivity, specificity and accuracy? A: That paper is under review Q: How do you found the number of epochs? A: Usually I use a validations set and I stop when the error does not decrease anymore. Q: How do you regularize your network? A: The net is trained with plain SGD. There are no regularization layers. Q: Have you seen any implementation of CNNs using CUDA 6 XT Libraries? ( in theory almost perfect scaling on Multi GPU ) A: No. CUBLAS is not very good CNN because one have to first unroll the convolutions, then use cublas for matrix operations. Q: Once trained, how long does it take for a DNN to classify objects? A: Using one 580 GTX GPU, depending on the net or model (multiple nets), from 10 to 100000 images/s. Q: People use Bag of Words for image classification and in that they use histograms as representation. That also loses 2D information and yet it performs well. Do you think that DNN would work well if it takes histograms as input? A: --unanswered-- Q: what would be the effect of using kohonen's SOM in for initial convolution to detect features? A: I do not know. If you try it, please let me know if there is something interesting. Q: Do you always use 3x3 convolutions? Or are larger windows sometimes used? A: Most of the time I use small convolution kenels because I want a deep net. If the input image is big, one can use even 20x20 convolution kernels. Trianing will start to be much slower. Q: Follow up to 3x3 window question... I see now that you are sometimes using 4x4 or 5x5 in the convolution stages (interesting that it's not always the earlier stage using larger windows...). Are the kernel sizes also learned? If not, what is the intuition behind selecting a kernel size at a given layer? A: No, they are not learned. Intuition: the smaller kernel size that produces a map with even size dimmensions required by a 2x2 max pooling. Bigger max pooling kernels are decreasing the map size too fast and usually do not allow for a deep enough net. I got the question of how I choose the net architecture all the time, including in all reviews from my paper. It is no magic there, the algorithms works as well with a broad range of values for the net's parameters. Q: About traffic signs^ what kind of color filter did you use? A: The filters are randomly initialized and learned during trianing. Q: Dan, thank-you very much for the lecture, I have to go to feed my kids now, but I'll certainly hunt out the copy of the slides. Many thanks. A: --unanswered-- Q: are any of these trained weights & nets available publicly? A: no, but if you need some and I still have them I can share them with you. Q: Where did all the diagonal forms of the matrices go, that's most of what I remember from OCR in the late 80s A: I do not know what are you reffering to. Q: Why don't you preprocess the input to extract eigenvalues to reduce dimensions? A: I tried that too. The problem is to decide how many of them to extract. In the end, it is better to let the training algorithm to extract whatever information is more relevant from the training set. Also, if eigenvalues are extracted, then CNN cannot be used anymore. One would have to reconstruct the images back from eigen values. On MNIST this didn't work better than using original images. Q: how sensitive to color and lighting variations are these methods? A: ig the trianing set contains enough variation, the method are roboust to color and lighting variation. One can even use color/lighting distortions on the input imager. Q: do you calculate the Jacobian explicitly or use an approximation? what is the impact on convergence and runtime? A: SGD does not require computing the Jacobian.