I have never dug into low level things like cpu architectures etc. and decided to give it a try when I learned about cpu.land.
I already was aware of the existence of user and kernel mode but while I was reading site it came to me that "I still can harm my system with userland programs so what does it mean to switch user mode for almost everything other than kernel and drivers?" also we still can do many things with syscalls, what is that stopping us(assuming we want to harm system of course) from damaging our system.
[edit1]: grammar mistakes
The reason behind kernel mode/user mode separation is to require all user-land programs to have to go through the kernel to do any modification to the system. In other words, would it not be for syscalls, the only thing a user land program could do would be to burn CPU cycles. And even then, the kernel can still preempt it any time to let other, potentially more important programs, run instead.
So if a program can harm your system from userland, it’s because the kernel allowed it, every time. Which is why we currently see a slow move toward sandboxing everything. Basically, the idea of sandboxing is to give the kernel enough information about the running program so that we can tailor which syscalls it can do and with which arguments. For example: you want to prevent an application from accessing the network? Prevent it from allocating sockets through the associated syscall.
The reason for this slow move is historical really: introducing all those protections from the get go would require a lot of development time to start with, but it had to be built unpon non-existant security layers and not break all programs in the process. CPUs were not even powerful enough to waste cycles on such concerns.
Now, to better understand user mode/kernel mode, you have to realize that there are actually more modes than this. I can only speak for the ARM architecture because it’s the one I know, but x86 has similar mechanisms. Basically, from the CPU perspective, you have several privilege levels. On x86 those are called rings, on ARM, they’re called Exception Level. On ARM, a CPU has up to four of those, EL3 to EL0. They also have names based on their purpose (inherited from ARMv7). So EL3 is firmware level, EL2 is hypervisor, EL1 is system and EL0 is user. A kernel typically run on EL2 and EL1. EL3 is reserved for the firmware/boot process to do the most basic setup, partly required by the other ELs. EL2 is called hypervisor because it allows to have several virtual EL1 (and even EL2). In other words, a kernel running at EL2 can run several other kernels at EL1: this is virtualization and how VMs are implemented. Then you have your kernel/user land separation with most of the kernel (and driver) logic running at EL1 and the user programs running at EL0.
Each level allocates resources for the sub-level (under the form of memory map, as memory maps, which do not necessarily map to RAM, are also used to talk to devices). Would a level try to access a resource (memory address) it has no rights to, an exception would be raised to the upper level, which would then decide what to do: let it through or terminate the program (the later translates to a kernel panic/BSOD when the program in question is the kernel itself or a segmentation fault/bus error for user land programs).
This mechanism is fairly easy to understand with the swap mechanism: the kernel allows your program to access some page in memory when asked through brk or mmap, used by malloc. But then, when the system is under memory pressure, and it turns out your program has not used that memory region for a little while, the kernel swaps it out. Which means your program is now forbidden from accessing this memory. When the program tries to access that memory again, the kernel is informed of the action through a exception raised (unintentionally) by your program. The kernel then swaps back the memory region from disk, allows your program to access the memory region again, and then let the program resume to a state prior to the memory access (that it will then re-attempt without even realizing).
So basically, a level is fully responsible for what a sub-level does. In theory, you could have no protection at all: EL1 (the kernel) could allow EL0 to modify all the memory EL1 has access to (again, those are memory maps, that can also map to devices, not necessarily RAM). In practice, the goal of EL1 is to let nothing through without being involved itself: the program wants to write something on the disk: syscall, wants more memory: syscall, wants to draw something on the screen: syscall, use the network: syscall, talk to another program: syscall.
But the reason is not only security. It is also, and most importantly, abstraction. For example, when talking to a USB device, a user program does not have to know the USB protocol. This is implemented once in the kernel and then userland programs can use that to deal with all the annoying stuff such as timings, buffers, interruptions and so on. So the syscalls were initially designed for that: build a library of functions all user programs can re-use without having to re-implement them, or worse, without having to deal with the specifics of every device/vendor: this is the sole responsibility of the kernel.
So there you have it: a user program cannot harm the computer without going through the kernel first. But the kernel allows it nonetheless because it was not initially designed as a security feature. The security concerns came afterward and were initially implemented with users, which are mostly enough for servers, and where root has nearly as many privileges as the kernel itself (because the kernel allows it). Those are currently being improved under the form of sandboxes, for which the work started a while ago, with every OS (and CPU architecture) having its own implementation. But we are only seeing widespread adoption by userland since fairly recently on desktop. Partly thanks to the push from smartphones where application-level privileges (to access the camera for example) were born AFAIK.
Nowadays, CPUs are powerful enough to even have security features to try to protect a userland program from itself: from buffer overflow, return address manipulation and the like. If you’re interested, I recommend you look at the concept of pointer authentication.
Great write up, very informative. Thank you.
I’d say that the separation is not just about security. It’s also about performance and stability through separation of duties. The kernelb does a lot of work that is but directly related to current activity in userland. A good portion of this is to keep hardware in a state where userland programs can reliably run without having to reimplement low level functionality.
Additionally, when it comes to performance, it’s worth looking at monolith vs microkernels. It may seem counter-intuitive but, for general-purpose computing, monolithic kernels outperform microkernels by a wide margin. This is due to the near exponential increase in context switching that ends up being required by microkernels for the sorts of tasks needed for such use cases.
This was very informative, thanks !
I appreciate every single second you spent on this, I hope more people will benefit
Well hopefully you can’t harm your computer with userland programs. Windows is perhaps a bit messy at this, generally, but Unix-like systems have pretty good protections against non-superusers interfering with either the system itself, or other users on the system.
Having drivers run in the kernel and applications run in userland also means unintentional application errors generally won’t crash your entire system. Which is pretty important…
Windows 7 and later, have even better anti-non-superuser protections than Unix-like systems. It’s taken a while for Linux to add a capabilities permission system to limit superusers, something that’s been available on Windows all the time.
Er, selinux was released nearly a decade before Windows 7, and was integrated into mainline just a few years later, even before vista added UAC.
Big difference between “not available” and “often not enabled”.
Windows 95 already had an equivalent of selinux in the policy editor, “often not enabled”. UAC is the equivalent of sudo, previously “not available”.
Windows 7 also had runtime driver and executable signature testing (“not available” on Linux), virtual filesystem views for executables (“not available” on Linux), overall system auditing (“often not enabled” on Linux), an outbound per-executable firewall (“not available” on Linux), extended ACLs for the filesystem (“often not enabled” and in part “not available” on Linux)… and so on.
Now, Linux is great, it had a much more solid kernel model from the beginning, and being OpenSource allows having a purpose-built kernel for either security, flexibility, tinkerability, or whatever. But it’s still lacking several security features from Windows, which are useful in a generalistic system that allows end-users to run random software.
Android had to fix those shortcomings by pushing most software into a JVM, while Flatpak is getting popular on Linux. Modern Windows does most of that transparently… at a hit to performance… and doesn’t let you opt-out, which angers tinkerers… but those are the drawbacks of security.
dd if=/dev/zero of=/dev/sda
is a userland program, which I would say causes harm./dev/sda
access requires superuser/root permissions from the kernel, which means asking the kernel to lift many of the protections.On some unix systems (MacOS for example) you can’t even do that with root.
You’d need reboot into firmware, change some flags on the boot partition, and then reboot back into the regular operating system.
To install a new version of the operating system on a Mac, it creates a new snapshot of your boot hard drive, updates the system there, then reboots instructing the firmware to reboot on the new snapshot. The firmware does it’s a few checks of it’s own as well, and if it fails to boot then it will reboot on the old snapshot (which is only removed after successfully booting on to the new one). That’s not only a better/more reliable way to upgrade the operating system, it’s also the only way it can be done because even the kernel doesn’t have write access to those files.
The only drawback is you can’t use your computer while the firmware checks/boots the updated system. But Apple seems to be laying the foundations for a new process where your updated operating system will boot alongside the old version (with hypervisors) in the background, be fully tested/etc, and then it should be able to switch over to the other operating system pretty much instantly. It would likely even replace the windows of running software with a screenshot, then instruct the software to save it’s state and relaunch to restore functionality to the screenshot windows (they already do this if a Mac’s battery runs really low - closing everything cleanly before power cuts out, then restore everything once you charge the battery).
That’s interesting, I don’t have much contact with Apple’s ecosystem.
Sounds similar to a setup that Linux allows, with the root filesystem on btrfs, making a snapshot of it and updating, then live switching kernels. But there is no firmware support to make the switch, so it relies on root having full access to everything.
The hypervisors approach seem like what Windows is doing, where Windows itself gets booted in a Hyper-X VM, allowing WSL2 and every other VM to run at “native” speed (since “native” itself is a VM), and in theory should allow booting a parallel updated Windows, then just switching VMs.
On Linux there is also a feature for live migrating VMs, which allows software to keep running while they’re being migrated with just a minimum pause, so they could use something like that.
Yes, which is literally what OP is asking about. They mention system calls, and are asking, if a userland program can do dangerous thing using system calls, why is there a divide between user and kernel. “Because the kernel can then check permissions of the system call” is a great answer, but “hopefully you can’t harm your computer with userland programs” is completely wrong and misguided.
Yeah, security is in layers and userland isn’t automatically “safe”, if that’s what you’re pointing out. So I did mention non-superusers. Separating the kernel from userland applications is also critically important to (try to) prevent non-superusers from accessing APIs and devices which only superusers (or those in particular groups) are able to reach.
On x86, there are actually 4 ring levels (0 to 3), but only two (0 and 3) are used for everything. On modern hardware there are also virtualization and service, and remote management rings, sometimes referred as -1, -2 and -3.
what is that stopping us(assuming we want to harm system of course) from damaging our system
Some CPU instructions only work at a certain ring level or lower. For example, changing memory mappings, can only be done from ring 0 or below, so a userland program running in ring 3 that would try to access some other programs memory, will get an “forbidden instruction” exception, that would escalate to the kernel’s handler, and it could decide to kill the “malicious program”. There are also many interrupts a ring 0 program/kernel can set, to intercept different program behaviors and handle them as it sees (allow, modify, redirect, block, log, etc.).
In order to “harm your system”, as in wreak havoc with other programs, you need to either use a kernel function in some way, or get your code to execute at ring 0 (privilege escalation).
If you mean “harm your system” as in actual hardware, some drivers might allow you to overclock something, turn fans off, and end up with your GPU melting… but that would be a protection failure from the driver/hardware (hardware itself can have anti-overheat protections).
Depends what your definition of “damaging the system” is.
I think one of the motivations for having separate modes like this, with (some) separate registers for each, is to reduce the time taken to switch contexts between modes. If they didn’t have separate registers, the data in the user mode registers would have to be saved somewhere when making a switch into kernel mode, and then copied back again when switching back to user mode.
There are no separate registers, every call to kernel mode takes extra time precisely because it has to save all the caller’s registers, then restore them again before returning.
It involves even more registers than what’s visible to the user, because the kernel also has to change the ones related to memory and device access permissions.
If I had to guess (and simplify too) I would say that that would be the difference between your freshly compiled C program just dying with sigsegv then getting (mostly) peacefully cleaned up, or otherwise your system either flat out dying on you or even worse, something somewhere getting corrupted and the system spirals out of control possibly thrashing data and even hardware.
The idea behind user mode and kernel mode is that it gives the operating system a framework to establish security permissions etc. some operating systems might take this more seriously than others, but the point is that the modes are a feature of the cpu, provided by the manufacturer.
Also, when you’re talking about “harming” the system, you should consider what’s possible in user land vs kernel mode. Kernel mode is where drivers manipulate hardware - these days, there is an additional layer of safety/abstraction done in the firmware level, so software can’t create physical damage to the hardware (like the classic “hackers can turn your computer into a bomb” advertisement).
However, the kernel can:
- trash a filesystem by writing data directly to the drive
- trash system memory (RAM)
- trash cpu registers
In kernel mode, it’s very easy to cause the OS to crash via these methods. A user mode program will have much higher level access to the system and won’t be able to cause damage so easily. Programs often crash themselves - maybe you’ve seen null pointer exceptions, or out of bounds memory exceptions - these are caused by a userland program doing something it shouldn’t (even unintentionally), and the OS intervening to stop that. However, a userland program shouldn’t be able to crash the whole OS (e.g. cause a BSOD on windows, or a kernel panic on Linux). Usually when you see that, it’s caused by a driver. Drivers run in kernel mode.
As for being able to do bad things with syscalls, you’re exactly right, and that’s why we have permissions around syscalls :)
On Linux there’s systemd.exec, seccomp, the capability framework, and of course selinux. On openbsd they have pledge (which is slightly different, but their threat model is also slightly different to begin with). I’m not sure what windows offers in this regard, from a quick search it seems there isn’t an exact equivalent of the Linux systems, but there are still security frameworks.
There are many frameworks and permissions systems that form an operating system, and each one might cover a different area. OS security is a pretty broad topic but very interesting, I encourage you to keep learning and asking questions!
Also, I just woke up and haven’t had coffee, so please bear with my rambling post.
classic “hackers can turn your computer into a bomb” advertisement
Somewhat ironically, with hardware allowing drivers to overclock their speed, voltage, cooling, and thus temperature and heat output… which drivers allow userland software with cool visuals to tweak at will… and laptops with high energy density lithium batteries… that would be more plausible today than at the time of those advertisements.
(except for some CPUs that used to burn a hole in the motherboard if cooling stopped… but those didn’t explode; some PSUs exploded, but back then were not controllable by software)
When I last used a computer that had a single mode (about 20 years ago), I was in the habit of saving my work about every 15 seconds and manually backing up my documents (to an offline backup that wasn’t physically connected to the computer) multiple times per day.
That’s how often the computer crashed. I never had a virus in those days, it was always innocent and unintentional software bugs which would cause your computer to need a reboot regularly and occasionally delete all of your files.
Trust me, things are better now. I still save regularly and maintain backups, but I do it a lot less religiously than I used to, because I’ve lost my work just once in the last several years. It used to be far more often.