# Unlocking the Potential of AI

## About MediaTek Research

MediaTek Research is a specialized AI research unit within the Global MediaTek Group. With two state- of-the-art research centers located in Cambridge (UK) and National Taiwan University, we foster a collaborative environment where we work closely with esteemed institutions and academics worldwide.

Our team comprises accomplished researchers with diverse backgrounds in computer science, engineering, mathematics, and physics. This expertise enables us to approach the most pressing challenges from multiple angles, fostering innovation and cross-disciplinary collaboration to seek both fundamental breakthroughs and practical applications.

### Vision

Our vision is to push the limits of what is possible in Artificial Intelligence (AI) and Machine Learning (ML). We are committed to advancing the field by developing innovative technologies that empower people, while striving to create systems that are genuinely intelligent, ethical, secure, and sustainable.

Our goal is to enable machines to learn, reason, and interact with humans in ways that are natural, intuitive, and beneficial to society; it should enhance the human potential, enabling us to lead happier, healthier, and more fulfilling lives. We believe that by pushing the boundaries of what AI can do, we can unlock new opportunities, discoveries, and progress that will shape our future.

### Research

### Papers

### Papers

#### Delayed Feedback in Kernel Bandits

MediaTek Research has recently published a paper that discusses delayed feedback of kernel bandits, which considers black box optimisation of an unknown function from expensive and noisy evaluations; a ubiquitous problem in machine learning, academic research and industrial production.

#### Image generation with shortest path diffusion

MediaTek Research has recently published a paper that investigates the field of image generation and Diffusion Models, which learn to progressively reverse a given image corruption. Recently, a few studies introduced alternative ways of corrupting images in Diffusion Models, with an emphasis on blurring. However, these studies are purely empirical, and it remains unclear what is the optimal procedure for corrupting an image.

#### Uniform Generalization Bounds for Overparameterized Neural Networks

MediaTek Research has recently published a paper that observes artificial neural networks that perform favorable generalization errors despite typically being extremely overparameterized. It is well known that classical statistical learning methods often result in vacuous generalization errors in the case of overparameterized neural networks. Adopting the recently developed Neural Tangent (NT) kernel theory, MediaTek Research proves uniform generalization bounds for overparameterized neural networks in kernel regimes, when the true data generating model belongs to the reproducing kernel Hilbert space (RKHS) corresponding to the NT kernel.

#### T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

MediaTek Research has recently published a paper that observes in Spoken Language Understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pre-trained language models (PLM, e.g. T5). Where most previous works use pre-trained language models with subword-based tokenization, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenization is underexplored.

#### Hierarchical Representations in Dense Passage Retrieval for Question-Answering

MediaTek Research has recently published a paper that aims to improve the question-answering performance by retrieving accompanying information that contains factual evidence matching the question. These retrieved documents are then fed into a reader that generates an answer. A commonly applied retriever is dense passage retrieval. In this retriever, the output of a transformer neural network is used to query a knowledge database for matching documents. Inspired by the observation that different layers of a transformer network provide rich representations with different levels of abstraction, MediaTek Research hypothesize that useful queries can be generated not only at the output layer, but at every layer of a transformer network, and that the hidden representations of different layers may combine to improve the fetched documents for reader performance. This novel approach integrates retrieval into each layer of a transformer network, exploiting the hierarchical representations of the input question. This paper shows that the proposed technique outperforms prior work on downstream tasks such as question answering, demonstrating its effectiveness.

#### Sample Complexity of Kernel-Based Q-Learning

MediaTek Research has recently published a paper that investigates how modern Reinforcement Learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, MediaTek Research derives sample complexities for kernel-based Q-learning when a generative model exists.

#### Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods, and Results

MediaTek Research has recently published a paper that presents the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models, presented by BigScience in 2022. Starting from released models, the team extended the pre-training of BLOOM by an additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability.

#### Protecting Sensitive Attributes by Adversarial Training Through Class-Overlapping Techniques

In recent years, machine learning as a service (MLaaS) has brought considerable convenience to our daily lives. However, these services raise the issue of leaking users’ sensitive attributes, such as race, when provided through the cloud.

#### Fisher-Legendre (FishLeg) optimization of deep neural networks

MediaTek Research has recently published a paper that observes incorporating second-order gradient information (curvature) into optimization can dramatically reduce the number of iterations required to train machine learning models. In natural gradient descent, such information comes from the Fisher information matrix which yields a number of desirable properties. As exact natural gradient updates are intractable for large models, successful methods such as KFAC and sequels approximate the Fisher in a structured form that can easily be inverted. However, this requires model/layer-specific tensor algebra and certain approximations that are often difficult to justify.

#### A Learning-Based Algorithm for Early Floorplan with Flexible Blocks

This paper presents a learning-based algorithm using graph neural network (GNN) and deconvolution network to predict the placement of the locations and the aspect ratios for the design blocks with flexible rectangles. With several hours of training on 4 GPUs, the proposed method, targeting at minimizing the cost of wirelength, can generate the placements in early stage of floorplan which is superior to that from the manual placements which requires several days’ efforts for physical design experts.

#### Gradient Descent: Robustness to Adversarial Corruption

Optimization using gradient descent (GD) is a ubiquitous practice in various machine learning problems including training large neural networks. Noise-free GD and stochastic GD--corrupted by random noise--have been extensively studied in the literature, but less attention has been paid to an adversarial setting, that is subject to adversarial corruptions in the gradient values. In this work, we analyze the performance of GD under a proposed general adversarial framework. For the class of functions satisfying the Polyak-Łojasiewicz condition, we derive finite time bounds on a minimax optimization error. Based on this bound, we provide a guideline on the choice of learning rate sequence with theoretical guarantees on the robustness of GD against adversarial corruption.

#### Near-Optimal Collaborative Learning in Bandits

This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify -in pure exploration- or play -in regret minimization- its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization [Shi et al., 2021]. In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.

#### Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning

Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications for regression and optimization. It is well known that a major downside for kernel-based models is the high computational cost; given a dataset of n samples, the cost grows as O(n3). Existing sparse approximation methods can yield a significant reduction in the computational cost, effectively reducing the actual cost down to as low as O(n) in certain cases. Despite this remarkable empirical success, significant gaps remain in the existing results for the analytical bounds on the error due to approximation. In this work, we provide novel confidence intervals for the Nyström method and the sparse variational Gaussian process approximation method, which we establish using novel interpretations of the approximate (surrogate) posterior variance of the models. Our confidence intervals lead to improved performance bounds in both regression and optimization problems.

#### Regret Bounds for Noise-FreeKernel-Based Bandits

Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.

#### LPI: Learned Positional Invariances for Transfer of Task Structure and Zero-shot Planning

Real-world tasks often include interactions with the environment where our actions can drastically change the available or desirable long-term outcomes. One formulation of this in the reinforcement learning setting is in terms of nonMarkovian rewards. Here the reward function, and thus the available rewards, are themselves history-dependent, and dynamically change given the agent-environment interactions. An important challenge for navigating such environments is to be able to capture the structure of this dynamic reward function, in a way that is interpretable and allows for optimal planning. This structure, in conjunction with the particular task setting at hand, then determines the optimal order in which actions should be executed, or subtasks completed. Planning methods face the challenge of combinatorial explosion if all such orderings need to be evaluated, however, learning invariances inherent in the task structure can alleviate this pressure. Here we propose a solution to this problem by allowing the planning method to recognise task segments where temporal ordering is irrelevant for predicting reward outcomes downstream. To facilitate this, our agent simultaneously learns to segment a task and predict the changing reward function resulting from its actions, while also learning about the permutation invariances in the its history that are relevant for this prediction. This dual approach can allow zero-shot or few-shot generalisation for complex, dynamic reinforcement learning tasks

#### Adaptive Erasure of Spurious Sequences in Sensory Cortical Circuits

Sequential activity reflecting previously experienced temporal sequences is considered a hallmark of learning across cortical areas. However, it is unknown how cortical circuits avoid the converse problem: producing spurious sequences that are not reflecting sequences in their inputs. We develop methods to quantify and study sequentiality in neural responses. We show that recurrent circuit responses generally include spurious sequences, which are specifically prevented in circuits that obey two widely known features of cortical microcircuit organization: Dale’s law and Hebbian connectivity. In particular, spike-timing-dependent plasticity in excitation-inhibition networks leads to an adaptive erasure of spurious sequences. We tested our theory in multielectrode recordings from the visual cortex of awake ferrets. Although responses to natural stimuli were largely non-sequential, responses to artificial stimuli initially included spurious sequences, which diminished over extended exposure. These results reveal an unexpected role for Hebbian experience-dependent plasticity and Dale’s law in sensory cortical circuits.

#### Flexible Multiple-Objective Reinforcement Learning for Chip Placement

Recently, successful applications of reinforcement learning to chip placement have emerged. Pretrained models are necessary to improve efficiency and effectiveness. Currently, the weights of objective metrics (e.g., wirelength, congestion, and timing) are fixed during pretraining. However, fixed-weighed models cannot generate the diversity of placements required for engineers to accommodate changing requirements as they arise. This paper proposes flexible multiple-objective reinforcement learning (MORL) to support objective functions with inference-time variable weights using just a single pretrained model. Our macro placement results show that MORL can generate the Pareto frontier of multiple objectives effectively.

#### SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues

Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.

#### How to distribute data across tasks for meta-learning?

Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are trained on benchmarks with a fixed number of data points per task. This number is usually arbitrary and it is unknown how it affects performance at testing. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. Given a fixed budget of labels, should we use a small number of highly labelled tasks, or many tasks with few labels each? Should we allocate more labels to some tasks and less to others? We show that: 1) If tasks are homogeneous, there is a uniform optimal allocation, whereby all tasks get the same amount of data; 2) At fixed budget, there is a trade-off between number of tasks and number of data points per task, with a unique and constant optimum; 3) When trained separately, harder task should get more data, at the cost of a smaller number of tasks; 4) When training on a mixture of easy and hard tasks, more data should be allocated to easy tasks. Interestingly, Neuroscience experiments have shown that human visual skills also transfer better from easy tasks. We prove these results mathematically on mixed linear regression, and we show empirically that the same results hold for few-shot image classification on CIFAR-FS and mini-ImageNet. Our results provide guidance for allocating labels across tasks when collecting data for meta-learning

#### Optimal Order Simple Regret for Gaussian Process Bandits

Consider the sequential optimization of a continuous, possibly non-convex, and expensive to evaluate objective function f. The problem can be cast as a Gaussian Process (GP) bandit where f lives in a reproducing kernel Hilbert space (RKHS). The state of the art analysis of several learning algorithms shows a significant gap between the lower and upper bounds on the simple regret performance. When

#### Scalable Thompson Sampling using Sparse Gaussian Process Models

Thompson Sampling (TS) from Gaussian Process (GP) models is a powerful tool for the optimization of black-box functions. Although TS enjoys strong theoretical guarantees and convincing empirical performance, it incurs a large computational overhead that scales polynomially with the optimization budget. Recently, scalable TS methods based on sparse GP models have been proposed to increase the scope of TS, enabling its application to problems that are sufficiently multi-modal, noisy or combinatorial to require more than a few hundred evaluations to be solved. However, the approximation error introduced by sparse GPs invalidates all existing regret bounds. In this work, we perform a theoretical and empirical analysis of scalable TS. We provide theoretical guarantees and show that the drastic reduction in computational complexity of scalable TS can be enjoyed without loss in the regret performance over the standard TS. These conceptual claims are validated for practical implementations of scalable TS on synthetic benchmarks and as part of a real-world high-throughput molecular design task.

#### Cyclic Orthogonal Convolutions for Long-Range Integration of Features

In Convolutional Neural Networks (CNNs) information flows across a small neighbourhood of each pixel of an image, preventing long-range integration of features before reaching deep layers in the network. We propose a novel architecture that allows flexible information flow between features z and locations (x, y) across the entire image with a small number of layers. This architecture uses a cycle of three orthogonal convolutions, not only in (x, y) coordinates, but also in (x, z) and (y, z) coordinates. We stack a sequence of such cycles to obtain our deep network, named CycleNet. As this only requires a permutation of the axes of a standard convolution, its performance can be directly compared to a CNN. Our model obtains competitive results at image classification on CIFAR-10 and ImageNet datasets, when compared to CNNs of similar size. We hypothesise that long-range integration favours recognition of objects by shape rather than texture, and we show that CycleNet transfers better than CNNs to stylised images. On the Pathfinder challenge, where integration of distant features is crucial, CycleNet outperforms CNNs by a large margin. We also show that even when employing a small convolutional kernel, the size of receptive fields of CycleNet reaches its maximum after one cycle, while conventional CNNs require a large number of layers

#### A Domain-Shrinking based Bayesian Optimization Algorithm with Order-Optimal Regret Performance

We consider sequential optimization of an unknown function in a reproducing kernel Hilbert space. We propose a Gaussian process-based algorithm and establish its order-optimal regret performance (up to a poly-logarithmic factor). This is the first GP-based algorithm with an order-optimal regret guarantee. The proposed algorithm is rooted in the methodology of domain shrinking realized through a sequence of tree-based region pruning and refining to concentrate queries in increasingly smaller high-performing regions of the function domain. The search for high-performing regions is localized and guided by an iterative estimation of the optimal function value to ensure both learning efficiency and computational efficiency. Compared with the prevailing GP-UCB family of algorithms, the proposed algorithm reduces computational complexity by a factor of

#### Natural Continual Learning: Success is a Journey, not (just) a Destination

Biological agents are known to learn many different tasks over the course of their lives, and to be able to revisit previous tasks and behaviors with little to no loss in performance. In contrast, artificial agents are prone to 'catastrophic forgetting' whereby performance on previous tasks deteriorates rapidly as new ones are acquired. This shortcoming has recently been addressed using methods that encourage parameters to stay close to those used for previous tasks. This can be done by (i) using specific parameter regularizers that map out suitable destinations in parameter space, or (ii) guiding the optimization journey by projecting gradients into subspaces that do not interfere with previous tasks. However, parameter regularization has been shown to be relatively ineffective in recurrent neural networks (RNNs), a setting relevant to the study of neural dynamics supporting biological continual learning. Similarly, projection based methods can reach capacity and fail to learn any further as the number of tasks increases. To address these limitations, we propose Natural Continual Learning (NCL), a new method that unifies weight regularization and projected gradient descent. NCL uses Bayesian weight regularization to encourage good performance on all tasks at convergence and combines this with gradient projections designed to prevent catastrophic forgetting during optimization. NCL formalizes gradient projection as a trust region algorithm based on the Fisher information metric, and achieves scalability via a novel Kronecker-factored approximation strategy. Our method outperforms both standard weight regularization techniques and projection based approaches when applied to continual learning problems in RNNs. The trained networks evolve task-specific dynamics that are strongly preserved as new tasks are learned, similar to experimental findings in biological circuits.

#### Open Problem: Tight Online Confidence Intervals for RKHS Elements

Confidence intervals are a crucial building block in the analysis of various online learning problems. The analysis of kernel-based bandit and reinforcement learning problems utilize confidence intervals applicable to the elements of a reproducing kernel Hilbert space (RKHS). However, the existing confidence bounds do not appear to be tight, resulting in suboptimal regret bounds. In fact, the existing regret bounds for several kernelized bandit algorithms (e.g., GP-UCB, GP-TS, and their variants) may fail to even be sublinear. It is unclear whether the suboptimal regret bound is a fundamental shortcoming of these algorithms or an artifact of the proof, and the main challenge seems to stem from the online (sequential) nature of the observation points. We formalize the question of online confidence intervals in the RKHS setting and overview the existing results.

#### Meta-Learning with Negative Learning Rates

Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or "learning to learn" a distribution of tasks, where "learning" is represented by an outer loop, and "to learn" by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best.

#### Gambler Bandits and the Regret of Being Ruined

In this paper we consider a particular class of problems calledmultiarmed gambler bandits (MAGB) which constitutes a modified version of the Bernoulli MAB problem where two new elements must be taken into account:thebudget and therisk of ruin. The agent has an initial budget that evolves in time following the received rewards, which can be either +1 after asuccess or -1 after afailure. The problem can also be seen as a MAB version of the classicgambler's ruin game. The contribution of this paper is a preliminary analysis on the probability of being ruined given the current budget and observations, and the proposition of an alternative regret formulation, combining the classic regret notion with the expected loss due to the probability of being ruined. Finally, standard state-of-the-art methods are experimentally compared using the proposed metric.

#### The Stereoscopic Analog Trigger of the MAGIC Telescopes

The current generation of ground-based imaging atmospheric Cherenkov telescopes (IACTs) operate in the very-high-energy (VHE) domain from ~100 GeV to ~100 TeV. They use electronic digital trigger systems to discern the Cherenkov light flashes emitted by extensive air showers (EASs), from the overwhelming light of the night sky (LoNS) background. Near the telescope energy threshold, the number of emitted Cherenkov photons by gamma-ray-induced EASs is comparable to the fluctuations of the LoNS and the photon distribution at the Cherenkov-imaging camera plane becomes patchy. This results in a severe loss of effectiveness of the digital triggers based on combinatorial logic of thresholded signals. A stereoscopic analog trigger system has been developed for improving the detection capabilities of the Major Atmospheric Gamma-ray Imaging Cherenkov (MAGIC) telescopes at the lowest energies. It is based on the analog sum of the photosensor electrical signals. In this article, the architectural design, technical performances, and configuration of this stereoscopic analog trigger, dubbed “Sum-Trigger-II,” are described.

#### Cross-Lingual Transfer with MAML on Trees

In meta-learning, the knowledge learned from previous tasks is transferred to new ones, but this transfer only works if tasks are related. Sharing information between unrelated tasks might hurt performance, and it is unclear how to transfer knowledge across tasks that have a hierarchical structure. Our research extends a meta-learning model, MAML, by exploiting hierarchical task relationships. Our algorithm, TreeMAML, adapts the model to each task with a few gradient steps, but the adaptation follows the hierarchical tree structure: in each step, gradients are pooled across tasks clusters and subsequent steps follow down the tree. We also implement a clustering algorithm that generates the tasks tree without previous knowledge of the task structure, allowing us to make use of implicit relationships between the tasks. We show that TreeMAML successfully trains natural language processing models for cross-lingual Natural Language Inference by taking advantage of the language phylogenetic tree. This result is useful since most languages in the world are under-resourced and the improvement on cross-lingual transfer allows the internationalization of NLP models.

#### Non-reversible Gaussian processes for identifying latent dynamical structure in neural data

A common goal in the analysis of neural data is to compress large population recordings into sets of interpretable, low-dimensional latent trajectories. This problem can be approached using Gaussian process (GP)-based methods which provide uncertainty quantification and principled model selection. However, standard GP priors do not distinguish between underlying dynamical processes and other forms of temporal autocorrelation. Here, we propose a new family of “dynamical” priors over trajectories, in the form of GP covariance functions that express a property shared by most dynamical systems: temporal non-reversibility. Non-reversibility is a universal signature of autonomous dynamical systems whose state trajectories follow consistent flow fields, such that any observed trajectory could not occur in reverse. Our new multi-output GP kernels can be used as drop-in replacements for standard kernels in multivariate regression, but also in latent variable models such as Gaussian process factor analysis (GPFA). We therefore introduce GPFADS (Gaussian Process Factor Analysis with Dynamical Structure), which models single-trial neural population activity using low-dimensional, non-reversible latent processes. Unlike previously proposed non-reversible multi-output kernels, ours admits a Kronecker factorization enabling fast and memory-efficient learning and inference. We apply GPFADS to synthetic data and show that it correctly recovers ground truth phase portraits. GPFADS also provides a probabilistic generalization of jPCA, a method originally developed for identifying latent rotational dynamics in neural data. When applied to monkey M1 neural recordings, GPFADS discovers latent trajectories with strong dynamical structure in the form of rotations.

### Tek Talk

### Tek Talk

### Seminars

### Seminars

### Papers

### Tek Talk

### Seminars

### Papers

#### Delayed Feedback in Kernel Bandits

MediaTek Research has recently published a paper that discusses delayed feedback of kernel bandits, which considers black box optimisation of an unknown function from expensive and noisy evaluations; a ubiquitous problem in machine learning, academic research and industrial production.

#### Image generation with shortest path diffusion

MediaTek Research has recently published a paper that investigates the field of image generation and Diffusion Models, which learn to progressively reverse a given image corruption. Recently, a few studies introduced alternative ways of corrupting images in Diffusion Models, with an emphasis on blurring. However, these studies are purely empirical, and it remains unclear what is the optimal procedure for corrupting an image.

#### Uniform Generalization Bounds for Overparameterized Neural Networks

MediaTek Research has recently published a paper that observes artificial neural networks that perform favorable generalization errors despite typically being extremely overparameterized. It is well known that classical statistical learning methods often result in vacuous generalization errors in the case of overparameterized neural networks. Adopting the recently developed Neural Tangent (NT) kernel theory, MediaTek Research proves uniform generalization bounds for overparameterized neural networks in kernel regimes, when the true data generating model belongs to the reproducing kernel Hilbert space (RKHS) corresponding to the NT kernel.

#### T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

MediaTek Research has recently published a paper that observes in Spoken Language Understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pre-trained language models (PLM, e.g. T5). Where most previous works use pre-trained language models with subword-based tokenization, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenization is underexplored.

#### Hierarchical Representations in Dense Passage Retrieval for Question-Answering

MediaTek Research has recently published a paper that aims to improve the question-answering performance by retrieving accompanying information that contains factual evidence matching the question. These retrieved documents are then fed into a reader that generates an answer. A commonly applied retriever is dense passage retrieval. In this retriever, the output of a transformer neural network is used to query a knowledge database for matching documents. Inspired by the observation that different layers of a transformer network provide rich representations with different levels of abstraction, MediaTek Research hypothesize that useful queries can be generated not only at the output layer, but at every layer of a transformer network, and that the hidden representations of different layers may combine to improve the fetched documents for reader performance. This novel approach integrates retrieval into each layer of a transformer network, exploiting the hierarchical representations of the input question. This paper shows that the proposed technique outperforms prior work on downstream tasks such as question answering, demonstrating its effectiveness.

#### Sample Complexity of Kernel-Based Q-Learning

MediaTek Research has recently published a paper that investigates how modern Reinforcement Learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, MediaTek Research derives sample complexities for kernel-based Q-learning when a generative model exists.

#### Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods, and Results

MediaTek Research has recently published a paper that presents the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models, presented by BigScience in 2022. Starting from released models, the team extended the pre-training of BLOOM by an additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability.

#### Protecting Sensitive Attributes by Adversarial Training Through Class-Overlapping Techniques

In recent years, machine learning as a service (MLaaS) has brought considerable convenience to our daily lives. However, these services raise the issue of leaking users’ sensitive attributes, such as race, when provided through the cloud.

#### Fisher-Legendre (FishLeg) optimization of deep neural networks

MediaTek Research has recently published a paper that observes incorporating second-order gradient information (curvature) into optimization can dramatically reduce the number of iterations required to train machine learning models. In natural gradient descent, such information comes from the Fisher information matrix which yields a number of desirable properties. As exact natural gradient updates are intractable for large models, successful methods such as KFAC and sequels approximate the Fisher in a structured form that can easily be inverted. However, this requires model/layer-specific tensor algebra and certain approximations that are often difficult to justify.

#### A Learning-Based Algorithm for Early Floorplan with Flexible Blocks

This paper presents a learning-based algorithm using graph neural network (GNN) and deconvolution network to predict the placement of the locations and the aspect ratios for the design blocks with flexible rectangles. With several hours of training on 4 GPUs, the proposed method, targeting at minimizing the cost of wirelength, can generate the placements in early stage of floorplan which is superior to that from the manual placements which requires several days’ efforts for physical design experts.

#### Gradient Descent: Robustness to Adversarial Corruption

Optimization using gradient descent (GD) is a ubiquitous practice in various machine learning problems including training large neural networks. Noise-free GD and stochastic GD--corrupted by random noise--have been extensively studied in the literature, but less attention has been paid to an adversarial setting, that is subject to adversarial corruptions in the gradient values. In this work, we analyze the performance of GD under a proposed general adversarial framework. For the class of functions satisfying the Polyak-Łojasiewicz condition, we derive finite time bounds on a minimax optimization error. Based on this bound, we provide a guideline on the choice of learning rate sequence with theoretical guarantees on the robustness of GD against adversarial corruption.

#### Near-Optimal Collaborative Learning in Bandits

This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify -in pure exploration- or play -in regret minimization- its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization [Shi et al., 2021]. In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.

#### Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning

Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications for regression and optimization. It is well known that a major downside for kernel-based models is the high computational cost; given a dataset of n samples, the cost grows as O(n3). Existing sparse approximation methods can yield a significant reduction in the computational cost, effectively reducing the actual cost down to as low as O(n) in certain cases. Despite this remarkable empirical success, significant gaps remain in the existing results for the analytical bounds on the error due to approximation. In this work, we provide novel confidence intervals for the Nyström method and the sparse variational Gaussian process approximation method, which we establish using novel interpretations of the approximate (surrogate) posterior variance of the models. Our confidence intervals lead to improved performance bounds in both regression and optimization problems.

#### Regret Bounds for Noise-FreeKernel-Based Bandits

Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.

#### LPI: Learned Positional Invariances for Transfer of Task Structure and Zero-shot Planning

Real-world tasks often include interactions with the environment where our actions can drastically change the available or desirable long-term outcomes. One formulation of this in the reinforcement learning setting is in terms of nonMarkovian rewards. Here the reward function, and thus the available rewards, are themselves history-dependent, and dynamically change given the agent-environment interactions. An important challenge for navigating such environments is to be able to capture the structure of this dynamic reward function, in a way that is interpretable and allows for optimal planning. This structure, in conjunction with the particular task setting at hand, then determines the optimal order in which actions should be executed, or subtasks completed. Planning methods face the challenge of combinatorial explosion if all such orderings need to be evaluated, however, learning invariances inherent in the task structure can alleviate this pressure. Here we propose a solution to this problem by allowing the planning method to recognise task segments where temporal ordering is irrelevant for predicting reward outcomes downstream. To facilitate this, our agent simultaneously learns to segment a task and predict the changing reward function resulting from its actions, while also learning about the permutation invariances in the its history that are relevant for this prediction. This dual approach can allow zero-shot or few-shot generalisation for complex, dynamic reinforcement learning tasks

#### Adaptive Erasure of Spurious Sequences in Sensory Cortical Circuits

Sequential activity reflecting previously experienced temporal sequences is considered a hallmark of learning across cortical areas. However, it is unknown how cortical circuits avoid the converse problem: producing spurious sequences that are not reflecting sequences in their inputs. We develop methods to quantify and study sequentiality in neural responses. We show that recurrent circuit responses generally include spurious sequences, which are specifically prevented in circuits that obey two widely known features of cortical microcircuit organization: Dale’s law and Hebbian connectivity. In particular, spike-timing-dependent plasticity in excitation-inhibition networks leads to an adaptive erasure of spurious sequences. We tested our theory in multielectrode recordings from the visual cortex of awake ferrets. Although responses to natural stimuli were largely non-sequential, responses to artificial stimuli initially included spurious sequences, which diminished over extended exposure. These results reveal an unexpected role for Hebbian experience-dependent plasticity and Dale’s law in sensory cortical circuits.

#### Flexible Multiple-Objective Reinforcement Learning for Chip Placement

Recently, successful applications of reinforcement learning to chip placement have emerged. Pretrained models are necessary to improve efficiency and effectiveness. Currently, the weights of objective metrics (e.g., wirelength, congestion, and timing) are fixed during pretraining. However, fixed-weighed models cannot generate the diversity of placements required for engineers to accommodate changing requirements as they arise. This paper proposes flexible multiple-objective reinforcement learning (MORL) to support objective functions with inference-time variable weights using just a single pretrained model. Our macro placement results show that MORL can generate the Pareto frontier of multiple objectives effectively.

#### SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues

Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.

#### How to distribute data across tasks for meta-learning?

Meta-learning models transfer the knowledge acquired from previous tasks to quickly learn new ones. They are trained on benchmarks with a fixed number of data points per task. This number is usually arbitrary and it is unknown how it affects performance at testing. Since labelling of data is expensive, finding the optimal allocation of labels across training tasks may reduce costs. Given a fixed budget of labels, should we use a small number of highly labelled tasks, or many tasks with few labels each? Should we allocate more labels to some tasks and less to others? We show that: 1) If tasks are homogeneous, there is a uniform optimal allocation, whereby all tasks get the same amount of data; 2) At fixed budget, there is a trade-off between number of tasks and number of data points per task, with a unique and constant optimum; 3) When trained separately, harder task should get more data, at the cost of a smaller number of tasks; 4) When training on a mixture of easy and hard tasks, more data should be allocated to easy tasks. Interestingly, Neuroscience experiments have shown that human visual skills also transfer better from easy tasks. We prove these results mathematically on mixed linear regression, and we show empirically that the same results hold for few-shot image classification on CIFAR-FS and mini-ImageNet. Our results provide guidance for allocating labels across tasks when collecting data for meta-learning

#### Optimal Order Simple Regret for Gaussian Process Bandits

Consider the sequential optimization of a continuous, possibly non-convex, and expensive to evaluate objective function f. The problem can be cast as a Gaussian Process (GP) bandit where f lives in a reproducing kernel Hilbert space (RKHS). The state of the art analysis of several learning algorithms shows a significant gap between the lower and upper bounds on the simple regret performance. When

#### Scalable Thompson Sampling using Sparse Gaussian Process Models

Thompson Sampling (TS) from Gaussian Process (GP) models is a powerful tool for the optimization of black-box functions. Although TS enjoys strong theoretical guarantees and convincing empirical performance, it incurs a large computational overhead that scales polynomially with the optimization budget. Recently, scalable TS methods based on sparse GP models have been proposed to increase the scope of TS, enabling its application to problems that are sufficiently multi-modal, noisy or combinatorial to require more than a few hundred evaluations to be solved. However, the approximation error introduced by sparse GPs invalidates all existing regret bounds. In this work, we perform a theoretical and empirical analysis of scalable TS. We provide theoretical guarantees and show that the drastic reduction in computational complexity of scalable TS can be enjoyed without loss in the regret performance over the standard TS. These conceptual claims are validated for practical implementations of scalable TS on synthetic benchmarks and as part of a real-world high-throughput molecular design task.

#### Cyclic Orthogonal Convolutions for Long-Range Integration of Features

In Convolutional Neural Networks (CNNs) information flows across a small neighbourhood of each pixel of an image, preventing long-range integration of features before reaching deep layers in the network. We propose a novel architecture that allows flexible information flow between features z and locations (x, y) across the entire image with a small number of layers. This architecture uses a cycle of three orthogonal convolutions, not only in (x, y) coordinates, but also in (x, z) and (y, z) coordinates. We stack a sequence of such cycles to obtain our deep network, named CycleNet. As this only requires a permutation of the axes of a standard convolution, its performance can be directly compared to a CNN. Our model obtains competitive results at image classification on CIFAR-10 and ImageNet datasets, when compared to CNNs of similar size. We hypothesise that long-range integration favours recognition of objects by shape rather than texture, and we show that CycleNet transfers better than CNNs to stylised images. On the Pathfinder challenge, where integration of distant features is crucial, CycleNet outperforms CNNs by a large margin. We also show that even when employing a small convolutional kernel, the size of receptive fields of CycleNet reaches its maximum after one cycle, while conventional CNNs require a large number of layers

#### A Domain-Shrinking based Bayesian Optimization Algorithm with Order-Optimal Regret Performance

We consider sequential optimization of an unknown function in a reproducing kernel Hilbert space. We propose a Gaussian process-based algorithm and establish its order-optimal regret performance (up to a poly-logarithmic factor). This is the first GP-based algorithm with an order-optimal regret guarantee. The proposed algorithm is rooted in the methodology of domain shrinking realized through a sequence of tree-based region pruning and refining to concentrate queries in increasingly smaller high-performing regions of the function domain. The search for high-performing regions is localized and guided by an iterative estimation of the optimal function value to ensure both learning efficiency and computational efficiency. Compared with the prevailing GP-UCB family of algorithms, the proposed algorithm reduces computational complexity by a factor of

#### Natural Continual Learning: Success is a Journey, not (just) a Destination

Biological agents are known to learn many different tasks over the course of their lives, and to be able to revisit previous tasks and behaviors with little to no loss in performance. In contrast, artificial agents are prone to 'catastrophic forgetting' whereby performance on previous tasks deteriorates rapidly as new ones are acquired. This shortcoming has recently been addressed using methods that encourage parameters to stay close to those used for previous tasks. This can be done by (i) using specific parameter regularizers that map out suitable destinations in parameter space, or (ii) guiding the optimization journey by projecting gradients into subspaces that do not interfere with previous tasks. However, parameter regularization has been shown to be relatively ineffective in recurrent neural networks (RNNs), a setting relevant to the study of neural dynamics supporting biological continual learning. Similarly, projection based methods can reach capacity and fail to learn any further as the number of tasks increases. To address these limitations, we propose Natural Continual Learning (NCL), a new method that unifies weight regularization and projected gradient descent. NCL uses Bayesian weight regularization to encourage good performance on all tasks at convergence and combines this with gradient projections designed to prevent catastrophic forgetting during optimization. NCL formalizes gradient projection as a trust region algorithm based on the Fisher information metric, and achieves scalability via a novel Kronecker-factored approximation strategy. Our method outperforms both standard weight regularization techniques and projection based approaches when applied to continual learning problems in RNNs. The trained networks evolve task-specific dynamics that are strongly preserved as new tasks are learned, similar to experimental findings in biological circuits.

#### Open Problem: Tight Online Confidence Intervals for RKHS Elements

Confidence intervals are a crucial building block in the analysis of various online learning problems. The analysis of kernel-based bandit and reinforcement learning problems utilize confidence intervals applicable to the elements of a reproducing kernel Hilbert space (RKHS). However, the existing confidence bounds do not appear to be tight, resulting in suboptimal regret bounds. In fact, the existing regret bounds for several kernelized bandit algorithms (e.g., GP-UCB, GP-TS, and their variants) may fail to even be sublinear. It is unclear whether the suboptimal regret bound is a fundamental shortcoming of these algorithms or an artifact of the proof, and the main challenge seems to stem from the online (sequential) nature of the observation points. We formalize the question of online confidence intervals in the RKHS setting and overview the existing results.

#### Meta-Learning with Negative Learning Rates

Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or "learning to learn" a distribution of tasks, where "learning" is represented by an outer loop, and "to learn" by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best.

#### Gambler Bandits and the Regret of Being Ruined

In this paper we consider a particular class of problems calledmultiarmed gambler bandits (MAGB) which constitutes a modified version of the Bernoulli MAB problem where two new elements must be taken into account:thebudget and therisk of ruin. The agent has an initial budget that evolves in time following the received rewards, which can be either +1 after asuccess or -1 after afailure. The problem can also be seen as a MAB version of the classicgambler's ruin game. The contribution of this paper is a preliminary analysis on the probability of being ruined given the current budget and observations, and the proposition of an alternative regret formulation, combining the classic regret notion with the expected loss due to the probability of being ruined. Finally, standard state-of-the-art methods are experimentally compared using the proposed metric.

#### The Stereoscopic Analog Trigger of the MAGIC Telescopes

The current generation of ground-based imaging atmospheric Cherenkov telescopes (IACTs) operate in the very-high-energy (VHE) domain from ~100 GeV to ~100 TeV. They use electronic digital trigger systems to discern the Cherenkov light flashes emitted by extensive air showers (EASs), from the overwhelming light of the night sky (LoNS) background. Near the telescope energy threshold, the number of emitted Cherenkov photons by gamma-ray-induced EASs is comparable to the fluctuations of the LoNS and the photon distribution at the Cherenkov-imaging camera plane becomes patchy. This results in a severe loss of effectiveness of the digital triggers based on combinatorial logic of thresholded signals. A stereoscopic analog trigger system has been developed for improving the detection capabilities of the Major Atmospheric Gamma-ray Imaging Cherenkov (MAGIC) telescopes at the lowest energies. It is based on the analog sum of the photosensor electrical signals. In this article, the architectural design, technical performances, and configuration of this stereoscopic analog trigger, dubbed “Sum-Trigger-II,” are described.

#### Cross-Lingual Transfer with MAML on Trees

In meta-learning, the knowledge learned from previous tasks is transferred to new ones, but this transfer only works if tasks are related. Sharing information between unrelated tasks might hurt performance, and it is unclear how to transfer knowledge across tasks that have a hierarchical structure. Our research extends a meta-learning model, MAML, by exploiting hierarchical task relationships. Our algorithm, TreeMAML, adapts the model to each task with a few gradient steps, but the adaptation follows the hierarchical tree structure: in each step, gradients are pooled across tasks clusters and subsequent steps follow down the tree. We also implement a clustering algorithm that generates the tasks tree without previous knowledge of the task structure, allowing us to make use of implicit relationships between the tasks. We show that TreeMAML successfully trains natural language processing models for cross-lingual Natural Language Inference by taking advantage of the language phylogenetic tree. This result is useful since most languages in the world are under-resourced and the improvement on cross-lingual transfer allows the internationalization of NLP models.

#### Non-reversible Gaussian processes for identifying latent dynamical structure in neural data

A common goal in the analysis of neural data is to compress large population recordings into sets of interpretable, low-dimensional latent trajectories. This problem can be approached using Gaussian process (GP)-based methods which provide uncertainty quantification and principled model selection. However, standard GP priors do not distinguish between underlying dynamical processes and other forms of temporal autocorrelation. Here, we propose a new family of “dynamical” priors over trajectories, in the form of GP covariance functions that express a property shared by most dynamical systems: temporal non-reversibility. Non-reversibility is a universal signature of autonomous dynamical systems whose state trajectories follow consistent flow fields, such that any observed trajectory could not occur in reverse. Our new multi-output GP kernels can be used as drop-in replacements for standard kernels in multivariate regression, but also in latent variable models such as Gaussian process factor analysis (GPFA). We therefore introduce GPFADS (Gaussian Process Factor Analysis with Dynamical Structure), which models single-trial neural population activity using low-dimensional, non-reversible latent processes. Unlike previously proposed non-reversible multi-output kernels, ours admits a Kronecker factorization enabling fast and memory-efficient learning and inference. We apply GPFADS to synthetic data and show that it correctly recovers ground truth phase portraits. GPFADS also provides a probabilistic generalization of jPCA, a method originally developed for identifying latent rotational dynamics in neural data. When applied to monkey M1 neural recordings, GPFADS discovers latent trajectories with strong dynamical structure in the form of rotations.

### Tek Talk

### Seminars

### Latest Updates

### Field of Expertise

####
Generative

Models

####
Artificial

Intelligence

####
Wireless

Communication

####
Chip

Placement

####
Generative

Models

####
Artificial

Intelligence

####
Wireless

Communication

####
Chip

Placement

### Online Lectures

### You might also be interested in

## MediaTek Al Processing Unit (APU)

MediaTe develops its own Deep Learning Accelerators (Performance Cores), Visual Processing Units (Flexible Cores), hardware-based, multicore scheduler, and software development kits (NeuroPilot) that make up the core components of its industry-leading Al Processing Units (APUs).

## MediaTek NeuroPilot

We're meeting the Edge Al challenge head-on with MediaTek NeuroPilot. Through the heterogeneous computing capabilities in our So's such as APUs, GPUs and CPUs, we are providing high-performance and power efficiency for Al features and applications. Developers can target these specific processing units within the chip, or, they can let MediaTe NeuroPilot SD intelligently handle the processing allocation for them.

Interested in knowing more about

MediaTek Research?

Please feel free to contact us:

info@mtkresearch.com