DAT_2_X2

Official paper pretrain model

SRFormer: Permuted Self-Attention for Single Image Super-Resolution

Yupeng Zhou 1, Zhen Li 1, Chun-Le Guo 1, Song Bai 2, Ming-Ming Cheng 1, Qibin Hou 1

1TMCC, School of Computer Science, Nankai University

2ByteDance, Singapore

arXiv

The official PyTorch implementation of SRFormer: Permuted Self-Attention for Single Image Super-Resolution (arxiv). SRFormer achieves state-of-the-art performance in

    classical image SR
    lightweight image SR
    real-world image SR

The results can be found here.

    Abstract: In this paper, we introduce SRFormer, a simple yet effective Transformer-based model for single image super-resolution. We rethink the design of the popular shifted window self-attention, expose and analyze several characteristic issues of it, and present permuted self-attention (PSA). PSA strikes an appropriate balance between the channel and spatial information for self-attention, allowing each Transformer block to build pairwise correlations within large windows with even less computational burden. Our permuted self-attention is simple and can be easily applied to existing super-resolution networks based on Transformers. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. We hope our simple and effective approach can serve as a useful tool for future research in super-resolution model design. Our code is publicly available at https://github.com/HVision-NKU/SRFormer.

Contents

    Installation & Dataset
    Training
    Testing
    Results
    Pretrain Models
    Citations
    License and Acknowledgement

Installation & Dataset

    python 3.8
    pyTorch >= 1.7.0

cd SRFormer
pip install -r requirements.txt
python setup.py develop

Dataset

We use the same training and testing sets as SwinIR, the following datasets need to be downloaded for training.
Task 	Training Set 	Testing Set
classical image SR 	DIV2K (800 training images) or DIV2K +Flickr2K (2650 images) 	Set5 + Set14 + BSD100 + Urban100 + Manga109
lightweight image SR 	DIV2K (800 training images) 	Set5 + Set14 + BSD100 + Urban100 + Manga109
real-world image SR 	DIV2K (800 training images) +Flickr2K (2650 images) + OST (10324 images for sky,water,grass,mountain,building,plant,animal) 	RealSRSet+5images
Training

    Please download the dataset corresponding to the task and place them in the folder specified by the training option in folder /options/train/SRFormer
    Follow the instructions below to train our SRFormer.

# train SRFormer for classical SR task
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_SRx2_scratch.yml
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_SRx3_scratch.yml
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_SRx4_scratch.yml
# train SRFormer for lightweight SR task
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_light_SRx2_scratch.yml
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_light_SRx3_scratch.yml
./scripts/dist_train.sh 4 options/train/SRFormer/train_SRFormer_light_SRx4_scratch.yml

Testing

# test SRFormer for classical SR task
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_DF2Ksrx2.yml
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_DF2Ksrx3.yml
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_DF2Ksrx4.yml
# test SRFormer for lightweight SR task
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_light_DIV2Ksrx2.yml
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_light_DIV2Ksrx3.yml
python basicsr/test.py -opt options/test/SRFormer/test_SRFormer_light_DIV2Ksrx4.yml

Results

We provide the results on classical image SR, lightweight image SR, realworld image SR. More results can be found in the paper. The visual results of SRFormer will upload to google drive soon.

Classical image SR

    Results of Table 4 in the paper

    Results of Figure 4 in the paper

Lightweight image SR

    Results of Table 5 in the paper

    Results of Figure 5 in the paper

Model size comparison

    Results of Table 1 and Table 2 in the Supplementary Material

Realworld image SR

    Results of Figure 8 in the paper

Pretrain Models

Pretrain Models can be download from google drive. To reproduce the results in the article, you can download them and put them in the /PretrainModel folder.
Citations

You may want to cite:

@article{zhou2023srformer,
  title={SRFormer: Permuted Self-Attention for Single Image Super-Resolution},
  author={Zhou, Yupeng and Li, Zhen and Guo, Chun-Le and Bai, Song and Cheng, Ming-Ming and Hou, Qibin},
  journal={arXiv preprint arXiv:2303.09735},
  year={2023}
}

License and Acknowledgement

This project is released under the Apache 2.0 license. The codes are based on BasicSR, Swin Transformer, and SwinIR. Please also follow their licenses. Thanks for their awesome works.

SRFormerLight_SRx2_DIV2K

[Swift-SRGAN - Rethinking Super-Resolution for real-time inference](https://github.com/Koushik0901/Swift-SRGAN)

In recent years, there have been several advancements in the task of image super-resolution using the state of the art Deep Learning-based architectures. Many super-resolution-based techniques previously published, require high-end and top-of-the-line Graphics Processing Unit (GPUs) to perform image super-resolution. With the increasing advancements in Deep Learning approaches, neural networks have become more and more compute hungry. We took a step back and, focused on creating a real-time efficient solution. We present an architecture that is faster and smaller in terms of its memory footprint. The proposed architecture uses Depth-wise Separable Convolutions to extract features and, it performs on-par with other super-resolution GANs (Generative Adversarial Networks) while maintaining real-time inference and a low memory footprint. A real-time super-resolution enables streaming high resolution media content even under poor bandwidth conditions. While maintaining an efficient trade-off between the accuracy and latency, we are able to produce a comparable performance model which is one-eighth (1/8) the size of super-resolution GANs and computes 74 times faster than super-resolution GANs.

NOTE: The author used the wrong file extensions for the models *on GitHub*. You will download a `.pth.tar` file. This is not actually a TAR file. Change the file extension to just `.pth` and the model will work.

Swift-SRGAN 2x

Official 2x pretrain for [SPAN](https://github.com/hongyuanyu/span).

spanx2_ch48

DAT_2_X3

Official 4x finetune pretrain from [ATD](https://github.com/LabShuHangGU/Adaptive-Token-Dictionary)

# Adaptive Token Dictionary

This repository is an official implementation of the paper "Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary", CVPR, 2024.

[[Arxiv](https://arxiv.org/abs/2401.08209)] [[visual results](https://drive.google.com/drive/folders/1HwEbAGU6WEw9ZGbFdt__BOJo_5DKflEb?usp=sharing)] [[pretrained models](https://drive.google.com/drive/folders/1D3BvTS1xBcaU1mp50k3pBzUWb7qjRvmB?usp=sharing)]

By Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, and Shuhang Gu.

> **Abstract:** Single Image Super-Resolution is a classic computer vision problem that involves estimating high-resolution (HR) images from low-resolution (LR) ones. Although deep neural networks (DNNs), especially Transformers for super-resolution, have seen significant advancements in recent years, challenges still remain, particularly in limited receptive field caused by window-based self-attention. To address these issues, we introduce a group of auxiliary **A**daptive **T**oken **D**ictionary to SR Transformer and establish an **ATD**-SR method. The introduced token dictionary could learn prior information from training data and adapt the learned prior to specific testing image through an adaptive refinement step. The refinement strategy could not only provide global information to all input tokens but also group image tokens into categories. Based on category partitions, we further propose a category-based self-attention mechanism designed to leverage distant but similar tokens for enhancing input features. The experimental results show that our method achieves the best performance on various single image super-resolution benchmarks.
> 
> <img width="800" src="figures/tdca-acmsa.png"> 
> <img width="800" src="figures/arch.png"> 
> <br/><br/>
> <img width="800" src="figures/viscomp_076.png">

003_ATD_SRx4_finetune

Official 4x Pretrain of Dual Aggregation Transformer for Image Super-Resolution DAT

Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods.

DAT_x4

Official DAT2 4x Pretrain of Dual Aggregation Transformer for Image Super-Resolution DAT

Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods.

DAT_2_x4

Official 4x drct-l Pretrain model, as released on their [DRCT Github Page](https://github.com/ming053l/DRCT?tab=readme-ov-file)

DRCT-L_X4

[ESRGAN](https://github.com/xinntao/ESRGAN)

Pretrained: RRDB_ESRGAN_x4.pth

4xESRGAN

Official Paper pretrain model.

See https://github.com/XPixelGroup/HAT for more details.

HAT-L_SRx4_ImageNet-pretrain

Official Paper model.

See https://github.com/XPixelGroup/HAT for more details.

HAT-S_SRx4

[Omni Aggregation Networks for Lightweight Image Super-Resolution (OmniSR)](https://github.com/Francis0625/Omni-SR)

While lightweight ViT framework has made tremendous progress in image super-resolution, its uni-dimensional
self-attention modeling, as well as homogeneous aggregation scheme, limit its effective receptive field (ERF) to
include more comprehensive interactions from both spatial and channel dimensions. To tackle these drawbacks,
this work proposes two enhanced components under a new Omni-SR architecture. First, an Omni Self-Attention (OSA) block is proposed based on dense interaction principle, which can simultaneously model pixel-interaction from both spatial and channel dimensions, mining the potential correlations across omni-axis (i.e., spatial and channel). Coupling with mainstream window partitioning strategies, OSA can achieve superior performance with compelling computational budgets. Second, a multi-scale interaction scheme is proposed to mitigate sub-optimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso-/global-scale interactions, rendering an omni-scale aggregation building block. Extensive experiments demonstrate that Omni-SR achieves recordhigh performance on lightweight super-resolution benchmarks (e.g., 26.95dB@Urban100 ×4 with only 792K parameters). Our code is available at https://github.com/Francis0625/Omni-SR

OmniSR 4x DF2K

OmniSR 4x DIV2K

RGT x4

Official 4x rgt-s pretrain from [RGT](https://github.com/zhengchen1999/RGT)

# Recursive Generalization Transformer for Image Super-Resolution

[Zheng Chen](https://zhengchen1999.github.io/), [Yulun Zhang](http://yulunzhang.com/), [Jinjin Gu](https://www.jasongt.com/), [Linghe Kong](https://www.cs.sjtu.edu.cn/~linghe.kong/), and [Xiaokang Yang](https://scholar.google.com/citations?user=yDEavdMAAAAJ), "Recursive Generalization Transformer for Image Super-Resolution", ICLR, 2024

[[paper](https://openreview.net/pdf?id=owziuM1nsR)] [[arXiv](https://arxiv.org/abs/2303.06373)] [[supplementary material](https://openreview.net/attachment?id=owziuM1nsR&name=supplementary_material)] [[visual results](https://drive.google.com/drive/folders/1TWIl66LPtojEbnlUr-s7qkUuTd7RF7Hp?usp=sharing)] [[pretrained models](https://drive.google.com/drive/folders/1UNn5LvnfQAi6eHAHz-mTYWu8vCJs5kwu?usp=sharing)]

---

> **Abstract:** Transformer architectures have exhibited remarkable performance in image superresolution (SR). Since the quadratic computational complexity of the selfattention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is crucial for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices ($query$, $key$, and $value$) are further scaled to mitigate the redundancy in the channel domain. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
---

RGT_S

[Structure-Preserving Super Resolution with Gradient Guidance](https://github.com/Maclory/SPSR)

Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.

4x SPSR

SRFormer_SRx4_DF2K

Swift-SRGAN 4x

Purpose: Pretrain to quick start new models.

craft pretrain

Model trained only on downscale degradation (bicubic, bilinear, nearest, lanczos and mitchell). Can be used to start new Real-CUGAN models. ONNX included on the dir.

cugan_pretrain

[Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) aims at developing Practical Algorithms for General Image/Video Restoration.

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
We extend the powerful ESRGAN to a practical restoration application (namely, Real-ESRGAN), which is trained with pure synthetic data.

RealESRGAN_x4Plus

Real-World Super-Resolution via Kernel Estimation and Noise Injection

Our solution is the winner of CVPR NTIRE 2020 Challenge on Real-World Super-Resolution in both tracks.

Recent state-of-the-art super-resolution methods have achieved impressive performance on ideal datasets regardless of blur and noise. However, these methods always fail in real-world image super-resolution, since most of them adopt simple bicubic downsampling from high-quality images to construct Low-Resolution (LR) and High-Resolution (HR) pairs for training which may lose track of frequency-related details. To address this issue, we focus on designing a novel degradation framework for real-world images by estimating various blur kernels as well as real noise distributions. Based on our novel degradation framework, we can acquire LR images sharing a common domain with real-world images. Then, we propose a real-world super-resolution model aiming at better perception. Extensive experiments on synthetic noise data and real-world images demonstrate that our method outperforms the state-of-the-art methods, resulting in lower noise and better visual quality. In addition, our method is the winner of NTIRE 2020 Challenge on both tracks of Real-World Super-Resolution, which significantly outperforms other competitors by large margins.

for corrupted images with processing noise.

RealSR DF2K

Real-World Super-Resolution via Kernel Estimation and Noise Injection

Our solution is the winner of CVPR NTIRE 2020 Challenge on Real-World Super-Resolution in both tracks.

Recent state-of-the-art super-resolution methods have achieved impressive performance on ideal datasets regardless of blur and noise. However, these methods always fail in real-world image super-resolution, since most of them adopt simple bicubic downsampling from high-quality images to construct Low-Resolution (LR) and High-Resolution (HR) pairs for training which may lose track of frequency-related details. To address this issue, we focus on designing a novel degradation framework for real-world images by estimating various blur kernels as well as real noise distributions. Based on our novel degradation framework, we can acquire LR images sharing a common domain with real-world images. Then, we propose a real-world super-resolution model aiming at better perception. Extensive experiments on synthetic noise data and real-world images demonstrate that our method outperforms the state-of-the-art methods, resulting in lower noise and better visual quality. In addition, our method is the winner of NTIRE 2020 Challenge on both tracks of Real-World Super-Resolution, which significantly outperforms other competitors by large margins.

for real images taken by cell phone camera.

RealSR DPED

Official 4x pretrain for [SPAN](https://github.com/hongyuanyu/span).

Architecture	Swift-SRGAN
Scale	4x
Size	64nf16nb
Color Mode	RGB
License	CC0-1.0 Private use Commercial use Distribution Modifications No Liability & Warranty Disclaimer
Date	2021-12-31
Dataset	DIV2K + Flickr2K
Dataset size	3669

Swift-SRGAN 4x

Similar Models