linux内核安全缓解机制

前言

面试的时候被问到linux内核保护模式,居然一时语塞,所以就试着去了解了下,主要借鉴参考文章1

LSM

LSM全称Linux Security Modules,Linux安全模块,是一个在内核模块的基础上提出的轻量级的安全访问控制框架。

LSM框架只是提供一个支持安全模块的接口,本身不能增强系统安全性。通过LSM框架,安全模块层的安全模块可以非常自由地在内核里加载和卸载,不需要对内核进行重新编译。

框架结构

  • 在内核数据中加入安全域(存放安全属性)
  • 在内核代码中不同关键点插入对hook函数的调用

访问控制模式

  • DAC:Discretionary Access Control,自主访问控制,是指对某个客体具有拥有权(或控制权)的主体,能够将对该客体的一种或多种访问权自主地授予其他主体,并在随后的任何时刻将这些权限回收。
  • MAC:Mandatory access control,强制访问控制,其目标是限制主体或发起者访问或对对象/目标执行某种操作的能力,主体通常是一个进程或线程,对象可能是文件、目录、TCP/UDP端口、共享内存段、I/O设备等。每当主体尝试访问对象时,都会由操作系统内核强制执行授权规则,检查安全属性并决定是否可进行访问;同样,任何主体对任何对象的任何操作都将根据一组授权规则进行测试,决定操作是否被允许。该模式下用户不能覆盖或修改策略,策略由安全管理员集中控制。

特点

区别与linux kernel module,至少有两个地方不同:

  • 必须由BootLoader启动内核时启动,不能再内核加载完后启动。
  • 不能同时启动两个LSM的实现

实现

SELinux

由美国NSA维护,基于强制访问控制MAC实现,基于角色的访问控制——进程只能访问那些存在他的任务中所需要的文件,简化用户的权限管理,减少了系统开销。

SELinux优点

通过对用户、进程权限的最小化,即使受到攻击,进程或用户权限被夺去,也不会对整个系统造成重大影响。

  • 对访问的控制彻底化(MAC)
  • 对子进程只赋予最小的权限
  • 防止权限升级
  • 对用户只赋予最小的权限

AppArmor

由OpenSuSE/Ubuntu维护,基于MAC实现,是SELinux的一个备选,SELinux是对文件加标签,AppArmor是对文件路径,配置起来更简单,且对系统的修改少。

PaX/Grsecuriyty

开始由PaX维护,后由Grsecuriyty团队维护,采用patch形式加入到linux。

linux防御机制

linux防御机制(摘自参考文章1):linux防御机制

注:

​ 绿——加入到内核主流的防御机制

​ 蓝——Out-of-tree denfense

​ 灰——商用防御

​ 青——护网防御

​ 白——通用防御机制

​ 紫——漏洞检测

​ 粉——漏洞

​ 黄——漏洞利用技术

防御机制

RANDSTRUCT

在编译时通过传入seed随机化重新排列域,缓解缓冲区溢出。(菜鸡翻译,这里是原文,下同:The randstruct plugin randomly rearranges fields at compile time given a randomization seed. When potential attackers do not know the layout of a structure, it becomes much harder for them to overwrite specific fields in those structures. )

RANDOMIZE_{BASE,MEMORY}

kaslr(内核地址空间随机化)的开关,但还需要cmdline的支持。

RANDOMIZE_BASE:每次启动都将kernel image映射在不同虚拟内存地址上。

RANDOMIZE_MEMORY:开启后将随机化线性映射区域基址、vmalloc区域基址、vmemmap区域基址。

LATENT_ENTROPY

将随机数写入latent_entropy这一全局变量,并加入到内核熵池从而增大内核熵,提高内核中密钥的变化。(This plugin mixes random values into the latent_entropy global variable in functions marked by the __latent_entropy attribute. The value of this global variable is added to the kernel entropy pool to increase the entropy.)

_ro_after_init

函数指针和敏感变量必须不可写,对于那些在init时初始化的变量可标记为__ro_after_init属性。

REFCOUNT_FULL

通过复制现有的原子refcount实现对refcount_t的溢出保护,通过添加一条指令来检测是否为负数,例如超过INT_MAX或小于0,但检测到时,会将其设置为INT_MAX/2。(This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely.)

TIF_FSCHECK flag

通过添加一个机制检查返回用户态时的地址,任何尝试设置fs寄存器的线程都会被设置TIF_FSCHECK这一标志,并终止。(Check address limit on user-mode return, added a mechanism to check the addr_limit value before returning to userspace. Any call to set_fs() sets a thread flag, TIF_FSCHECK, and if we see that on the return to userspace we go out of line to check that the addr_limit value is not elevated.)

bpf_jit_harden

能够使BPF即时编译器更加强壮,被eBPF JIT后端支持,能缓解JIT被堆喷,但启用后会牺牲性能。(This enables hardening for the BPF JIT compiler. Supported are eBPF JIT backends. Enabling hardening trades off performance, but can mitigate JIT spraying.)

MODULE_SIG*

加载模块时会检查签名的有效性,如果签名不存在或者签名内容不一致,会强制退出模块的加载。(Check modules for valid signatures upon load: the signature is simply appended to the module. )

SECURITY_LOADPIN

通过验证加载的模块、读取的防火墙机制、加载的安全策略等是否来自信任源,若来源不信任则不予加载。(That is the policy that LoadPin was created to implement. It takes advantage of the relatively new kernel file-loading mechanism to intercept all attempts to load a file into the kernel; these include loading kernel modules, reading firmware, loading a security policy, or loading an image for kexec().)

LDISC_AUTOLOAD

设置为Y时可以避免用户采用不同的、古老的行规程(line discipline,应该是这么翻译的)(Historically the kernel has always automatically loaded any line discipline that is in a kernel module when a user asks for it to be loaded with the TIOCSETD ioctl, or through other means. This is not always the best thing to do on systems where you know you will not be using some of the more “ancient” line disciplines, so prevent the kernel from doing this unless the request is coming from a process with the CAP_SYS_MODULE permissions.

Say ‘Y’ here if you trust your userspace users to do the right thing, or if you have only provided the line disciplines that you know you will be using, or if you wish to continue to use the traditional method of on-demand loading of these modules by any user.)

STRICT_{KERNEL,MODULE}_RWX

与用户态NX保护模式类似,将text段、rodata段设置为只读,其他非text段设置为不可执行。(If this is set, kernel text and rodata memory will be made read-only, and non-text memory will be made non-executable. This provides protection against certain security exploits (e.g. executing the heap or modifying text))

DEBUG_WX

若在boot中找到可写、执行段则发出警告。(Generate a warning if any W+X mappings are found at boot.)

ARM:RODATA_FULL_DEFAULT_ENABLED

将叶子页表设置为只读,当需要修改时再通过创建一个临时的交替页表进行更新(It’s fairly rare that linear mappings need to be updated, so to improve security we can map the leaf page table entries as read-only, this makes it harder for an attacker to modify the permissions of the linear mappings, while the overhead is low because the linear mappings don’t need to be changed frequently. When they do need to be updated we can use fixmaps to create a temporary alternative mapping to do the update.)

smep/PXN

x86中叫smep,arm中叫PXN。禁止内核CPU访问用户空间的数据。

smap/PAN

x86中叫smap,arm中叫PAN。禁止内核CPU执行用户空间的代码。

DEFAULT_MMAP_MIN_ADDR

设置最小可申请的地址空间,避免空指针误用。(mmap_min_addr is a kernel tunable that specifies the minimum virtual address that a process is allowed to mmap. )

pti/kpti

内核页表隔离,将用户态和内核态的地址空间放在不同页表中。(split the page tables, which are currently shared between user and kernel space, into two sets of tables, one for each side.)

X86:MICROCODE

将指令解码为microcode/Micro-Ops,这样如果出现CPU硬件错误,通过修改microcode即可。

X86:RETPOLINE

retpoline是google开发的用于针对Spectre变种2漏洞缓解利用的技术。每次CPU在即将执行简接跳转指令时,会去询问间接分支预测(indirect branch predictor),然后投机选择一个最优可能执行的路径;而retpoline则绕过了indirect branch predictor。retpoline包括了return和trampoline,也就是在间接跳转的时候用return指令添加了一个“蹦床”,主要是使用ret指令来实现地址跳转。(As far as I can piece this together from the limited information at the moment, a retpoline is a return trampoline that uses an infinite loop that is never executed to prevent the CPU from speculating on the target of an indirect jump.)

ARM:HARDEN_BRANCH_PREDICTOR

加强分支预测的,同样是针对Spectre漏洞的缓解技术。(This config option will take CPU-specific actions to harden the branch predictor against aliasing attacks and may rely on specific instruction sequences or control bits being set by the system firmware.)

spec_store_bypass_disable/ssbd

处理器与内存之间维护一个高速缓存,在通常情况下,现代处理器采用Speculative Store Bypass的优化方式提升对高速缓存的访问,但此方法会泄漏投机执行过程中相关的旁路信息,关闭该选项即可全局关闭Speculative Store Bypass优化,也可以通过调用prctl()使特定线程关闭该优化。在X86下如果要更新microcode,需要开启此选项。(Once enabled, mitigation within the guest is identical to bare-metal. User processes can use the prctl() system call and “seccomp” filtering, or system wide mitigation can be turned on with a kernel parameter.)

mds

由内核态切换到用户态、或由hypervisor切换到guset时,对store buffer、fill buffer、load port等缓存执行flush操作,该修复需要miscocode更新支持。当处理器支持SMT时,该修复还需要配合关闭SMT,以防止同一个physical core上的另一个logical CPU重新填充这些缓存。

l1tf

通过逆序存储non-present的PTE的所有bit,这样使用PFN字段字段逆序后在L1缓存中就会发生cache miss,避免被构造出PFN具有特定值的PTE,从而泄漏出L1缓存中的内容。

SLAB_FREELIST_RANDOM

随机化SLAB的freelist,从而缓解堆溢出攻击。(Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize the SLAB freelist. The list is randomized during initialization of a new set of pages. The order on different freelist sizes is pre-computed at boot for performance. Each kmem_cache has its own randomized freelist. Before pre-computed lists are available freelists are generated dynamically. This security feature reduces the predictability of the kernel SLAB allocator against heap overflows rendering attacks much less stable.)

SHUFFLE_PAGE_ALLOCATOR

页表申请随机化,但会增大没有缓存的平台的工作量。

slab_nomerge

即不迁移slab。迁移slab会导致内核堆溢出到其他不同的cache,不迁移则能避免堆溢出影响到其他cache。(For reduced kernel memory fragmentation, slab caches can be merged when they share the same size and other characteristics. This carries a small risk of kernel heap overflows being able to overwrite objects from merged caches, which reduces the difficulty of such heap attacks. By keeping caches unmerged, these kinds of exploits can usually only damage objects in the same cache. To disable merging at runtime, “slab_nomerge” can be passed on the kernel command line.)

unprivileged_userfaultfd

允许管理员禁用非特权用户使用缺页中断系统调用。

DEBUG_{LIST,SG,CREDENTIALS,NOTIFIFERS,VIRTUAL}

DEBUG_LIST:检查链表是否正常

DEBUG_SG:检查scatter-gather表

DEBUG_CREDENTIALS:检查指向证书的指针数是否正确,在SELinux下还会检查指向证书的指针

DEBUG_NOTIFIFERS:检查内核通知链

DEBUG_VIRTUAL:检查虚拟页表代码

BUG_ON_DATA_CORRUPTION

当遇到内核内存数据错误时报错。

STATIC_USERMODEHELPER

利用该变量指定的程序来执行从内核中调用用户态函数的操作。

LOCKDOWN_LSM

以模块化可选的载入,用于划清用户态和内核态代码界限,防止root用户篡改内核代码。

ARM:STACKPROTECTOR_PER_TASK

即canary保护。

STACKPROTECTOR

即canary保护。

FORTIFY_SOURCE

在编译时检查gets、memcpy等函数操作是否会造成缓冲区溢出。但只能检查出知道变量内存大小的溢出。

slub_debug

仅用于检测从slub分配器分配的内存是否溢出,同时也会改变堆的结构。含多个参数取值。

HARDENED_USERCOPY

在数据从用户空间拷贝到内核空间时做出检查,包括地址有效性、堆栈检查、代码段检查。

THREAD_INFO_IN_TASK

将线程信息从栈上转移到task_struct结构体中。

VMAP_STACK

将使用vmalloc申请的地址作为内核栈,并利用vmalloc的guard page增强了栈溢出检测能力,且减少了内存碎片化,但这些内存在物理上可能是不连续的。

SCHED_STACK_END_CHECK

当调用schedule()时检测栈结尾是否溢出,当溢出时则发出错误。(This option checks for a stack overrun on calls to schedule(). If the stack end location is found to be over written always panic as the content of the corrupted region can no longer be trusted. This is to ensure no erroneous behaviour occurs which could result in data corruption or a sporadic crash at a later stage once the region is examined. The runtime overhead introduced is minimal.)

PAX_MEMROY_STACKLEAK

一方面当进程从内核态进入用户态时,擦除内核栈的信息;另一方面检查进程内核栈是否溢出。

init_on_free/init_on_alloc

将申请到或释放的内存初始化,避免泄漏信息。

PAGE_POISONING

在free时填充特定字节,申请时检查这些字节是否被修改。

PAX_MEMORY_SANITIZE

用于将已经释放的内存进行擦除。

X86:X86_INTEL_UMIP

intel用户模式指令防护。在当前特权级别(CPL)大于0时,执行SGDT、SIDT、SMSW和STR指令会导致一般保护异常。

ARM:HARDEN_EL2_VECTORS

加强EL2向量映射,防止系统寄存器泄漏信息。

kptr_restrict

将/proc/kallsyms文件中函数地址值全部设置为0。

SECURITY_DMESG_RESTRICT

限制非管理员用户读取内核日志(syslog,通过dmesg查看)。

INIT_STACK_ALL

将未初始化的栈全部填充为0xaa,包括未初始化的变量。

STRUCTLEAK_BYREF_ALL

通过引用在内核编译时将变量初始化(gives the kernel complete initialization coverage of all stack variables passed by reference)

SLAB_FREELIST_HARDENED

加强freelist,避免常见的freelist攻击。

linux内核安全更新历史

传送门

后记

翻译了一遍安全模式,也看了好多文章,发现好多之前不懂的,比如miscocode、CWE等等,也算是不少收获吧,不枉花费了这两三天时间。

参考文章

Linux 安全缓解机制总结

LSM相关知识及理解

CPU漏洞详解