/usr/include/sys/jail.h:
struct jail {
u_int32_t version;
char *path;
char *hostname;
u_int32_t ip_number;
};
On most UNIX® systems,
root
has omnipotent power. This promotes insecurity. If an attacker gained
root
on a system, he would have every function at his fingertips. In FreeBSD there are sysctls which dilute the power of
root
, in order to minimize the damage caused by an attacker. Specifically, one of these functions is called
secure levels
. Similarly, another function which is present from FreeBSD 4.0 and onward, is a utility called
jail(8)
. Jail chroots an environment and sets certain restrictions on processes which are forked within the jail. For example, a jailed process cannot affect processes outside the jail, utilize certain system calls, or inflict any damage on the host environment.
Jail is becoming the new security model. People are running potentially vulnerable servers such as Apache, BIND, and sendmail within jails, so that if an attacker gains
root
within the jail, it is only an annoyance, and not a devastation. This article mainly focuses on the internals (source code) of jail. For information on how to set up a jail see the
handbook entry on jails
.
Jail consists of two realms: the userland program, jail(8) , and the code implemented within the kernel: the jail(2) system call and associated restrictions. I will be discussing the userland program and then how jail is implemented within the kernel.
The source for the userland jail is located in /usr/src/usr.sbin/jail , consisting of one file, jail.c . The program takes these arguments: the path of the jail, hostname, IP address, and the command to be executed.
In
jail.c
, the first thing I would note is the declaration of an important structure
struct jail j;
which was included from
/usr/include/sys/jail.h
.
The definition of the
jail
structure is:
/usr/include/sys/jail.h:
struct jail {
u_int32_t version;
char *path;
char *hostname;
u_int32_t ip_number;
};
As you can see, there is an entry for each of the arguments passed to the jail(8) program, and indeed, they are set during its execution.
/usr/src/usr.sbin/jail/jail.c
char path[PATH_MAX];
...
if (realpath(argv[0], path) == NULL)
err(1, "realpath: %s", argv[0]);
if (chdir(path) != 0)
err(1, "chdir: %s", path);
memset(&j, 0, sizeof(j));
j.version = 0;
j.path = path;
j.hostname = argv[1];
One of the arguments passed to the
jail(8)
program is an IP address with which the jail can be accessed over the network.
jail(8)
translates the IP address given into host byte order and then stores it in
j
(the
jail
structure).
/usr/src/usr.sbin/jail/jail.c:
struct in_addr in;
...
if (inet_aton(argv[2], &in) == 0)
errx(1, "Could not make sense of ip-number: %s", argv[2]);
j.ip_number = ntohl(in.s_addr);
The
inet_aton(3)
function "interprets the specified character string as an Internet address, placing the address into the structure provided." The
ip_number
member in the
jail
structure is set only when the IP address placed onto the
in
structure by
inet_aton(3)
is translated into host byte order by
ntohl(3)
.
Finally, the userland program jails the process. Jail now becomes an imprisoned process itself and then executes the command given using execv(3) .
/usr/src/usr.sbin/jail/jail.c
i = jail(&j);
...
if (execv(argv[3], argv + 3) != 0)
err(1, "execv: %s", argv[3]);
As you can see, the
jail()
function is called, and its argument is the
jail
structure which has been filled with the arguments given to the program. Finally, the program you specify is executed. I will now discuss how jail is implemented within the kernel.
We will now be looking at the file /usr/src/sys/kern/kern_jail.c . This is the file where the jail(2) system call, appropriate sysctls, and networking functions are defined.
In kern_jail.c , the following sysctls are defined:
/usr/src/sys/kern/kern_jail.c:
int jail_set_hostname_allowed = 1;
SYSCTL_INT(_security_jail, OID_AUTO, set_hostname_allowed, CTLFLAG_RW,
&jail_set_hostname_allowed, 0,
"Processes in jail can set their hostnames");
int jail_socket_unixiproute_only = 1;
SYSCTL_INT(_security_jail, OID_AUTO, socket_unixiproute_only, CTLFLAG_RW,
&jail_socket_unixiproute_only, 0,
"Processes in jail are limited to creating UNIX/IPv4/route sockets only");
int jail_sysvipc_allowed = 0;
SYSCTL_INT(_security_jail, OID_AUTO, sysvipc_allowed, CTLFLAG_RW,
&jail_sysvipc_allowed, 0,
"Processes in jail can use System V IPC primitives");
static int jail_enforce_statfs = 2;
SYSCTL_INT(_security_jail, OID_AUTO, enforce_statfs, CTLFLAG_RW,
&jail_enforce_statfs, 0,
"Processes in jail cannot see all mounted file systems");
int jail_allow_raw_sockets = 0;
SYSCTL_INT(_security_jail, OID_AUTO, allow_raw_sockets, CTLFLAG_RW,
&jail_allow_raw_sockets, 0,
"Prison root can create raw sockets");
int jail_chflags_allowed = 0;
SYSCTL_INT(_security_jail, OID_AUTO, chflags_allowed, CTLFLAG_RW,
&jail_chflags_allowed, 0,
"Processes in jail can alter system file flags");
int jail_mount_allowed = 0;
SYSCTL_INT(_security_jail, OID_AUTO, mount_allowed, CTLFLAG_RW,
&jail_mount_allowed, 0,
"Processes in jail can mount/unmount jail-friendly file systems");
Each of these sysctls can be accessed by the user through the
sysctl(8)
program. Throughout the kernel, these specific sysctls are recognized by their name. For example, the name of the first sysctl is
security.jail.set_hostname_allowed
.
Like all system calls, the
jail(2)
system call takes two arguments,
struct thread *td
and
struct jail_args *uap
.
td
is a pointer to the
thread
structure which describes the calling thread. In this context,
uap
is a pointer to the structure in which a pointer to the
jail
structure passed by the userland
jail.c
is contained. When I described the userland program before, you saw that the
jail(2)
system call was given a
jail
structure as its own argument.
/usr/src/sys/kern/kern_jail.c:
/*
* struct jail_args {
* struct jail *jail;
* };
*/
int
jail(struct thread *td, struct jail_args *uap)
Therefore,
uap→jail
can be used to access the
jail
structure which was passed to the system call. Next, the system call copies the
jail
structure into kernel space using the
copyin(9)
function.
copyin(9)
takes three arguments: the address of the data which is to be copied into kernel space,
uap→jail
, where to store it,
j
and the size of the storage. The
jail
structure pointed by
uap→jail
is copied into kernel space and is stored in another
jail
structure,
j
.
/usr/src/sys/kern/kern_jail.c: error = copyin(uap->jail, &j, sizeof(j));
There is another important structure defined in
jail.h
. It is the
prison
structure. The
prison
structure is used exclusively within kernel space. Here is the definition of the
prison
structure.
/usr/include/sys/jail.h:
struct prison {
LIST_ENTRY(prison) pr_list; /* (a) all prisons */
int pr_id; /* (c) prison id */
int pr_ref; /* (p) refcount */
char pr_path[MAXPATHLEN]; /* (c) chroot path */
struct vnode *pr_root; /* (c) vnode to rdir */
char pr_host[MAXHOSTNAMELEN]; /* (p) jail hostname */
u_int32_t pr_ip; /* (c) ip addr host */
void *pr_linux; /* (p) linux abi */
int pr_securelevel; /* (p) securelevel */
struct task pr_task; /* (d) destroy task */
struct mtx pr_mtx;
void **pr_slots; /* (p) additional data */
};
The
jail(2)
system call then allocates memory for a
prison
structure and copies data between the
jail
and
prison
structure.
/usr/src/sys/kern/kern_jail.c:
MALLOC(pr, struct prison *, sizeof(*pr), M_PRISON, M_WAITOK | M_ZERO);
...
error = copyinstr(j.path, &pr->pr_path, sizeof(pr->pr_path), 0);
if (error)
goto e_killmtx;
...
error = copyinstr(j.hostname, &pr->pr_host, sizeof(pr->pr_host), 0);
if (error)
goto e_dropvnref;
pr->pr_ip = j.ip_number;
Next, we will discuss another important system call jail_attach(2) , which implements the function to put a process into the jail.
/usr/src/sys/kern/kern_jail.c:
/*
* struct jail_attach_args {
* int jid;
* };
*/
int
jail_attach(struct thread *td, struct jail_attach_args *uap)
This system call makes the changes that can distinguish a jailed process from those unjailed ones. To understand what jail_attach(2) does for us, certain background information is needed.
On FreeBSD, each kernel visible thread is identified by its
thread
structure, while the processes are described by their
proc
structures. You can find the definitions of the
thread
and
proc
structure in
/usr/include/sys/proc.h
. For example, the
td
argument in any system call is actually a pointer to the calling thread’s
thread
structure, as stated before. The
td_proc
member in the
thread
structure pointed by
td
is a pointer to the
proc
structure which represents the process that contains the thread represented by
td
. The
proc
structure contains members which can describe the owner’s identity(
p_ucred
), the process resource limits(
p_limit
), and so on. In the
ucred
structure pointed by
p_ucred
member in the
proc
structure, there is a pointer to the
prison
structure(
cr_prison
).
/usr/include/sys/proc.h:
struct thread {
...
struct proc *td_proc;
...
};
struct proc {
...
struct ucred *p_ucred;
...
};
/usr/include/sys/ucred.h
struct ucred {
...
struct prison *cr_prison;
...
};
In
kern_jail.c
, the function
jail()
then calls function
jail_attach()
with a given
jid
. And
jail_attach()
calls function
change_root()
to change the root directory of the calling process. The
jail_attach()
then creates a new
ucred
structure, and attaches the newly created
ucred
structure to the calling process after it has successfully attached the
prison
structure to the
ucred
structure. From then on, the calling process is recognized as jailed. When the kernel routine
jailed()
is called in the kernel with the newly created
ucred
structure as its argument, it returns 1 to tell that the credential is connected with a jail. The public ancestor process of all the process forked within the jail, is the process which runs
jail(8)
, as it calls the
jail(2)
system call. When a program is executed through
execve(2)
, it inherits the jailed property of its parent’s
ucred
structure, therefore it has a jailed
ucred
structure.
/usr/src/sys/kern/kern_jail.c
int
jail(struct thread *td, struct jail_args *uap)
{
...
struct jail_attach_args jaa;
...
error = jail_attach(td, &jaa);
if (error)
goto e_dropprref;
...
}
int
jail_attach(struct thread *td, struct jail_attach_args *uap)
{
struct proc *p;
struct ucred *newcred, *oldcred;
struct prison *pr;
...
p = td->td_proc;
...
pr = prison_find(uap->jid);
...
change_root(pr->pr_root, td);
...
newcred->cr_prison = pr;
p->p_ucred = newcred;
...
}
When a process is forked from its parent process, the
fork(2)
system call uses
crhold()
to maintain the credential for the newly forked process. It inherently keep the newly forked child’s credential consistent with its parent, so the child process is also jailed.
/usr/src/sys/kern/kern_fork.c: p2->p_ucred = crhold(td->td_ucred); ... td2->td_ucred = crhold(p2->p_ucred);
Throughout the kernel there are access restrictions relating to jailed processes. Usually, these restrictions only check whether the process is jailed, and if so, returns an error. For example:
if (jailed(td->td_ucred))
return (EPERM);
System V IPC is based on messages. Processes can send each other these messages which tell them how to act. The functions which deal with messages are:
msgctl(3)
,
msgget(3)
,
msgsnd(3)
and
msgrcv(3)
. Earlier, I mentioned that there were certain sysctls you could turn on or off in order to affect the behavior of jail. One of these sysctls was
security.jail.sysvipc_allowed
. By default, this sysctl is set to 0. If it were set to 1, it would defeat the whole purpose of having a jail; privileged users from the jail would be able to affect processes outside the jailed environment. The difference between a message and a signal is that the message only consists of the signal number.
/usr/src/sys/kern/sysv_msg.c :
msgget(key, msgflg)
:
msgget
returns (and possibly creates) a message descriptor that designates a message queue for use in other functions.
msgctl(msgid, cmd, buf)
: Using this function, a process can query the status of a message descriptor.
msgsnd(msgid, msgp, msgsz, msgflg)
:
msgsnd
sends a message to a process.
msgrcv(msgid, msgp, msgsz, msgtyp, msgflg)
: a process receives messages using this function
In each of the system calls corresponding to these functions, there is this conditional:
/usr/src/sys/kern/sysv_msg.c:
if (!jail_sysvipc_allowed && jailed(td->td_ucred))
return (ENOSYS);
Semaphore system calls allow processes to synchronize execution by doing a set of operations atomically on a set of semaphores. Basically semaphores provide another way for processes lock resources. However, process waiting on a semaphore, that is being used, will sleep until the resources are relinquished. The following semaphore system calls are blocked inside a jail: semget(2) , semctl(2) and semop(2) .
/usr/src/sys/kern/sysv_sem.c :
semctl(semid, semnum, cmd, …)
:
semctl
does the specified
cmd
on the semaphore queue indicated by
semid
.
semget(key, nsems, flag)
:
semget
creates an array of semaphores, corresponding to
key
.
key and flag take on the same meaning as they do in msgget.
semop(semid, array, nops)
:
semop
performs a group of operations indicated by
array
, to the set of semaphores identified by
semid
.
System V IPC allows for processes to share memory. Processes can communicate directly with each other by sharing parts of their virtual address space and then reading and writing data stored in the shared memory. These system calls are blocked within a jailed environment: shmdt(2) , shmat(2) , shmctl(2) and shmget(2) .
/usr/src/sys/kern/sysv_shm.c :
shmctl(shmid, cmd, buf)
:
shmctl
does various control operations on the shared memory region identified by
shmid
.
shmget(key, size, flag)
:
shmget
accesses or creates a shared memory region of
size
bytes.
shmat(shmid, addr, flag)
:
shmat
attaches a shared memory region identified by
shmid
to the address space of a process.
shmdt(addr)
:
shmdt
detaches the shared memory region previously attached at
addr
.
Jail treats the
socket(2)
system call and related lower-level socket functions in a special manner. In order to determine whether a certain socket is allowed to be created, it first checks to see if the sysctl
security.jail.socket_unixiproute_only
is set. If set, sockets are only allowed to be created if the family specified is either
PF_LOCAL
,
PF_INET
or
PF_ROUTE
. Otherwise, it returns an error.
/usr/src/sys/kern/uipc_socket.c:
int
socreate(int dom, struct socket **aso, int type, int proto,
struct ucred *cred, struct thread *td)
{
struct protosw *prp;
...
if (jailed(cred) && jail_socket_unixiproute_only &&
prp->pr_domain->dom_family != PF_LOCAL &&
prp->pr_domain->dom_family != PF_INET &&
prp->pr_domain->dom_family != PF_ROUTE) {
return (EPROTONOSUPPORT);
}
...
}
The Berkeley Packet Filter provides a raw interface to data link layers in a protocol independent fashion. BPF is now controlled by the devfs(8) whether it can be used in a jailed environment.
There are certain protocols which are very common, such as TCP, UDP, IP and ICMP. IP and ICMP are on the same level: the network layer 2. There are certain precautions which are taken in order to prevent a jailed process from binding a protocol to a certain address only if the
nam
parameter is set.
nam
is a pointer to a
sockaddr
structure, which describes the address on which to bind the service. A more exact definition is that
sockaddr
"may be used as a template for referring to the identifying tag and length of each address". In the function
in_pcbbind_setup()
,
sin
is a pointer to a
sockaddr_in
structure, which contains the port, address, length and domain family of the socket which is to be bound. Basically, this disallows any processes from jail to be able to specify the address that does not belong to the jail in which the calling process exists.
/usr/src/sys/netinet/in_pcb.c:
int
in_pcbbind_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp,
u_short *lportp, struct ucred *cred)
{
...
struct sockaddr_in *sin;
...
if (nam) {
sin = (struct sockaddr_in *)nam;
...
if (sin->sin_addr.s_addr != INADDR_ANY)
if (prison_ip(cred, 0, &sin->sin_addr.s_addr))
return(EINVAL);
...
if (lport) {
...
if (prison && prison_ip(cred, 0, &sin->sin_addr.s_addr))
return (EADDRNOTAVAIL);
...
}
}
if (lport == 0) {
...
if (laddr.s_addr != INADDR_ANY)
if (prison_ip(cred, 0, &laddr.s_addr))
return (EINVAL);
...
}
...
if (prison_ip(cred, 0, &laddr.s_addr))
return (EINVAL);
...
}
You might be wondering what function
prison_ip()
does.
prison_ip()
is given three arguments, a pointer to the credential(represented by
cred
), any flags, and an IP address. It returns 1 if the IP address does NOT belong to the jail or 0 otherwise. As you can see from the code, if it is indeed an IP address not belonging to the jail, the protocol is not allowed to bind to that address.
/usr/src/sys/kern/kern_jail.c:
int
prison_ip(struct ucred *cred, int flag, u_int32_t *ip)
{
u_int32_t tmp;
if (!jailed(cred))
return (0);
if (flag)
tmp = *ip;
else
tmp = ntohl(*ip);
if (tmp == INADDR_ANY) {
if (flag)
*ip = cred->cr_prison->pr_ip;
else
*ip = htonl(cred->cr_prison->pr_ip);
return (0);
}
if (tmp == INADDR_LOOPBACK) {
if (flag)
*ip = cred->cr_prison->pr_ip;
else
*ip = htonl(cred->cr_prison->pr_ip);
return (0);
}
if (cred->cr_prison->pr_ip != tmp)
return (1);
return (0);
}
Even
root
users within the jail are not allowed to unset or modify any file flags, such as immutable, append-only, and undeleteable flags, if the securelevel is greater than 0.
/usr/src/sys/ufs/ufs/ufs_vnops.c:
static int
ufs_setattr(ap)
...
{
...
if (!priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0)) {
if (ip->i_flags
& (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) {
error = securelevel_gt(cred, 0);
if (error)
return (error);
}
...
}
}
/usr/src/sys/kern/kern_priv.c
int
priv_check_cred(struct ucred *cred, int priv, int flags)
{
...
error = prison_priv_check(cred, priv);
if (error)
return (error);
...
}
/usr/src/sys/kern/kern_jail.c
int
prison_priv_check(struct ucred *cred, int priv)
{
...
switch (priv) {
...
case PRIV_VFS_SYSFLAGS:
if (jail_chflags_allowed)
return (0);
else
return (EPERM);
...
}
...
}
Last modified on : February 18, 2025 by Fernando Apesteguía