Well, it is not just end of the year, but end of the **decade **as well. So, for the last article of 2019, we decided to write about **awarded **research **papers **from **Neural Information Processing Systems** (*NeurIPS*) conference. Looking at the logo of *NeurIPS * conference, my wife asked me “What is that? Is that a witchcraft conference?”. Close, but no, it is one of the most **important** machine learning conferences. In fact, this seems like appropriate *end of the decade *blog post, since *NeurIPS *awards is something like the *Oscars *in the world of machine learning. Every year a bunch of papers are proposed to and best papers are awarded. This was thirty-third *NeurIPS *conference and it was held in Vancouver from December 8th to 14th, 2019. Interesting thing is that this years’ conference had biggest number of **submissions **ever. Record-breaking 6743 submissions this year were made this year, which says a lot about the state of machine learning industry. In fact field grew so much, that tickets for this year conference were sold out in under **twelve **minutes. I know, mind blowing.

From huge number of submitted papers, 1428 were accepted and 3 were awarded. In general there are three **categories**:

*Outstanding Paper Award*– Best research paper of the conference*Outstanding New Directions Paper Award*– Research paper that is setting a path for future research.*Test of Time Award*– Research paper that was published 10 years ago at NeurIPS and that had a lasting impact on our community.

The *NeurIPS *committee was guided by several **criteria**. The best paper has to be revolutionary, creative and have certain elegance, but it has feasible, realistic and reproducible as well. It also shouldn’t be over complicated and inefficient. In our opinion committee have done an awesome job ðŸ™‚

## Outstanding Paper Award

### Distribution-Independent PAC Learning of Halfspaces with Massart Noise

Now this is one amazing paper! What strikes us the most is how this paper proposes an **elegant **new approach to the **old **problem. In a nutshell, this paper attacks one of the most influential machine learning problems – the problem of learning an unknown **halfspace**. Or to be more precise, it focuses on an algorithm for learning halfspaces in the distribution-independent *PAC *model with *Massart Noise*. Let’s decipher this a little bit. Halfspaces are functions that separate two **classes**, positive samples and negative samples, by a hyperplane. Basically, binary classification. They are also called *Linear Threshold Functions* (**LTFs**), are can be represented like this:

where *sign(u) = 1* if *u â‰¥ 0* and *sign(u) = âˆ’1* if *u < 0*, *w* are the weights and *x* are the features. In a nutshell, they are **boolean **functions that separate data into two spaces. If you want to observe this from the **deep learning** perspective it is the problem that *Rosenblattâ€™s Perceptron* tried to solve as well. The main issue here is that if data is **corrupted**, the result depends on the underlying noise model.

One of the algorithms that attack this problem of binary classification is **probably approximately correct** (*PAC*) learning. This model analyzes whether and under what conditions a learning agent will output an approximately correct **classification**. In it’s unsupervised, non-parametric statistical technique primarily used for dimensionality reduction.*Massart Noise* extends this approach. This is achieved by **flipping **the label of each sample/record with some small probability that is unknown to the learning agent. Will the label be flipped or not is decided by factor *n*. This factor has value smaller than *1/2*. In this research, a polynomial-time of *1/Îµ* was proved with an excess risk equal to the *Massart Noise* level plus *Îµ*.

Read the complete paper **here**.

Honorable mentions:

- Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses
- Fast and Accurate Least-Mean-Squares Solvers

## Outstanding New Directions Paper Award

### Uniform Convergence may be Unable to Explain Generalization in Deep Learning

As you know, we at **Rubik’s Code** love **deep learning**, so this paper straight up blew us away. We are facing that neural networks are today used in different industries for variety of problems. However, that was not always the **case**. In fact, a lot of industries are still **skeptical **about deep learning. They prefer standard machine learning models because they are **explainable**. And they have a good reason for that. Questions like “Why over-parameterized neural networks generalize well?” are still **opened**. How neural networks can perform well on data that was never seen, after they are trained on large real training dataset?

Variety of **generalization bounds** for **neural networks** has been developed for that particular reason. A generalization bound is a statement about the **predictive **performance of a learning algorithm, in this case neural network. Basically, neural network is observed as a **procedure **that takes some finite training data as input and returns a predicted labels on new data. Since we assume that all data, both training and evaluation, has fixed **distribution**, the quality of mentioned predictions can be measured in terms of its **risk**. This means that predictions are compared with the distribution of data, and risk represents level of it’s **incompatibility**. To sum it up, a generalization bound is a probabilistic bound on the **defect**.

Majority of generalization bounds are based on **uniform convergence**, which can be defined like this.

A sequence of functions

converges uniformly to a limiting functionfnon a setfif, given any arbitrarily small positive numberE, a numberÎµcan be found such that each of the functionsN,fn,fn+1, etc. differ fromfn+2by no more thanfat every pointÎµinx.E

Now, this paper challenges this assumption. It proposes set of experiments that proves how uniform convergence **can’t explain** generalization in **deep learning**. The **experiments **are done on *MNIST* dataset with three over-paramterized models and different hyperparameter settings for different training set sizes is tested. All models are trained on the *Stohastic Gradient Descend*. To be more precise, overparameterized **neural network** with only one hidden layer (with 100k neurons) is trained using *Stohastic Gradient Descend* on dataset with 100 dimensions. Now, if we increase training dataset size test error **decreases **and generalization **improves**. However, this paper proves that the decision boundary is not simple and that uniform convergence **increases **the bounds when training size is **increased**. This means that uniform convergence can’t completely explain the generalization and that we should develop **algorithm-independent** techniques that donâ€™t restrict themselves.

Read the complete paper **here**.

Honorable mentions:

- Putting An End to End-to-End: Gradient-Isolated Learning of Representations
- Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

## Test of Time

### Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Final award went to the paper published on *NeurIPS* that stood test of time and left **enduring impact **on machine learning community. Basically, committee overviews a list of accepted papers to *NeurIPS *ten years ago, that have had the most citations. This year that was a paper by* Lin Xiao*, who’s research explores the **fundamental concepts** of modern machine learning. This paper proposed a *Regularised Dual Averaging Method *(**RDA**), an optimization technique used for solving **online convex optimization** problems. *Online convex optimization* has goal same as *Stochastic Gradient Descent* – to minimize loss, however it is performed differently. In essence, it is simulated as a game where the player at each timestampÂ *t*, **predicts **a weight vector and the loss.

This approach had many **problems **before this paper. Probably the biggest impact that this paper had is the optimization method – *Batch Optimization.* Meaning, that initially only **fraction **of samples is available. Then the player calculates weight vector in time step *t*. Once this is done, loss is calculated with sub-gradient based on current weights. This is repeated in the next time step *t+1.*

Read the complete paper **here**.

## Conclusion

In this article, we explored the most interesting papers from NeurIPS Conference. In our humble opinion, they will shake the world of machine learning in the years to come.

Thanks for reading!

Read more posts from the author at **Rubikâ€™s Code**.

## Trackbacks/Pingbacks