redis源码阅读(一)

SDS简介

sds（Simple dynamic string）是用C语言编写的一种字符串类型，可以动态的扩容，为redis中的基本数据结构之一。这个项目现在独立出来，github地址.

sds有两个版本，在redis 3.2版本以前使用的第一个版本数据结构如下：

typedef char *sds;		// sds被定义为char*类型，具体好处见下文

struct sdshdr {
    unsigned int len;   //等于sds所保存字符串的长度
    unsigned int free;  //bug数组中未使用字节的数量
    char buf[];         //字节数组，用于保存字符串
};

图示：

step_2

redis 3.2版本中，sds的数据结构发生了更改，针对不同的长度范围定义了不同的结构，如下：

/*
len: sds字符串的实际长度
alloc: 分配给字符串的总容量，不包含header和'\0'的容量
flags: 类型的标志，用一个字节的低三位来保存类型，主要为SDS_TYPE_5(0),  SDS_TYPE_8(1), SDS_TYPE_16(2), SDS_TYPE_32(3), SDS_TYPE_64(4)四种类型，高5位在SDS_TYPE_5类型中被用来存储字符串的长度
*/

struct __attribute__ ((__packed__)) sdshdr5 {		//对应字符串长度小于 1<<5 
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {		//对应字符串长度小于 1<<8
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {		//对应字符串长度小于 1<<16
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {		//对应字符串长度小于 1<<32
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {		//对应字符串长度小于 1<<64
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

在新的版本中，sds被分为了5个类型。这里面的定义为5种不同的sds header，字符串的内容被保存在bug数组中。对比起旧版本，新版本的主要优势是，redis会根据字符串的长度使用不同长度的header头，从而达到优化内存的作用。__attribute__ ((__packed__)) 作用是告诉编译器取消字节对齐，为什么要这么做呢，我还不太理解，个人猜测是因为sds类型创建时返回的地址为字符串的开始地址，通过取消字节对齐，可以通过数组下标的方式方便去访问结构中其他的结构体成员，e.g：初始化sds返回的地址为字符串的首地址，如：sds s，通过s[-1]可以获取flags。此外，redis也专门针对这个问题做了专门的对齐优化。

sds和C字符串的区别

C语言使用长度为N+1的字符数组来表示长度为N的字符串，并且字符数组的最后一个字符总是为’\0’。

e.g： redis字符串用C语言的字符串表示为

‘R’

‘e’

‘d’

‘i’

’s’

‘\0’

, 数组长度为6。

1. 常数时间复杂度获取字符串长度

因为一个C字符串并不存储字符串的长度，程序必须遍历整个字符串，对遇到的每个字符进行计数，直到遇到’\0’才结束返回，时间复杂度为O(N)。

然而sds获取字符串长度的时间复杂度为O(1), 只需要通过访问sds结构的len属性即可。

Redis通过使用sds作为字符串键的底层实现，确保了获取字符串的长度的工作不会成为Redis的性能瓶颈。

2. 杜绝缓冲区溢出

因为C字符串不记录自身的长度，因此用户在使用的时候，在修改字符串的时候，很容易造成溢出，可能会导致程序直接报错，数组越界，或者是错误的修改了临近的内存区数据。

e.g：假设程序里有两个在内存中紧邻着的C字符串s1和s2，其中s1保存了字符串’Redis’, s2则保存了’MongoDB’。这时候假如一个程序员执行了strcat(s1, “ is the best”); ，将s1的内容修改为’Redis is the best’，但是却忘记在执行strcat前为s1分配足够的空间，那么执行完strcat函数之后，s1的数据将溢出到s2所在的内存空间中，从而导致s2的内容被意外修改，最终可能会导致s2中的’MongoDB’被覆盖写成’is the best’。

然而sds与C字符串不同，在每次需要修改sds的时候，会先进行sds的空间检查，如果空间不够，会先进行扩容处理，然后再执行实际的修改操作，所以使用sds的时候，既不需要手动修改sds的空间大小，也不会出现缓冲区溢出的问题。

3. 减少字符串修改所带来的内存重新分配次数

对于C字符串，数组进行一次内存重分配的操作：

如果执行的是append操作，程序需要先通过内存重分配来扩展底层数组的空间大小，如果忘了这一步就会产生缓冲区溢出；
如果执行的是trim操作，程序需要释放字符串中不再适用的空间，如果忘了这一步就会产生内存浪费甚至内存泄漏。

但是针对数据库的使用场景来看，修改的操作非常频繁，如果每一次修改都需执行一次内存重分配的话，会对性能造成很大的影响。

为了避免这个问题，sds通过未使用空间来解除字符串长度和底层数据长度之间的关联：在sds中，buf数组的长度不一定等于字符串长度加一。每次创建或者修改sds时，API都会为sds分配额外的未使用空间。

4. 二进制安全

C语言中的字符必须符合某种编码（如ASCII），并且除了字符串的末尾之外，字符串中不能包含空字符，否则最先被程序读入的空字符会被误认为是字符串的结束符，这些限制了C字符串只能保存文本数据，而不能保存如图片，音频，视频，压缩文件等二进制数据。

然而sds的buf数组，保存的是二进制数据，因为sds是通过len属性来判断字符串的结束，而不是像C字符串一样通过’\0’来判断，因此buf数组是可以存储包含了’\0’字符的字符串的。

5. 兼容部分C字符串的函数

因为typedef char *sds;把sds定义为了char *类型。

假如我们需要自行定义一个String，可能会这样写：

struct custom_sds {
		char *buf;                          //存储实际字符
    size_t len;                         //字符串的长度
}

如果需要打印custom_sds的内容，需要这样操作：

struct custom_sds *sds = newsds("Hello World");	//假设newsds完成内存分配并初始化buf为"Hello World"
printf("%s", sds->buf);

对于sds，如果我们要实现上面的功能，我们的代码是：

sds sds = sdsnew("Hello World");
printf("%s", sds);

之所以可以这样操作，是因为sds的结构是这样的：

+--------+-------------------------------+-----------+
| Header | Binary safe C alike string... | Null term |
+--------+-------------------------------+-----------+
         |
         `-> Pointer returned to the user.

sds的API sdsnew函数，返回的实际上是一个char *类型的指针，这个指针指向的地址为字符串的开始地址，这样带来的好处是：

兼容部分C字符串的函数，只要函数是用char *为参数，如：strcat， strcmp，而不需要通过结构体来获取地址以后再传递；
可以直接访问单个字符，如
```
printf("%c %c\n", sds[0], sds[1]);
```
如果想要通过custom_sds去访问，则需要每次获取buf的地址以后再访问，如custom_sds->buf[0], custom_sds->buf[1].
分配的空间总是连续的，对cache的命中率更加友好。原因是一次连续分配了Headers+String+Null，因此对于一个sds，它的每个部分的内存总是连续的；但是对于custom_sds，通常需要两次malloc操作，如下：
```
struct custom_sds *sds = (struct custom_sds *)malloc(sizeof(struct struct custom_sds));
sds->buf = (char *)malloc(sizeof(SIZE));
```
这两次分配并不能保证sds和buf的内存地址是连续的。

创建、扩容和销毁

1. 创建

sds s = sdsnew("Hello World");
printf("Length: %d, Type: %d\n", sdslen(s), sdsReqType(s));

Out>
Length: 12, Type: 0

先来看下创建的过程：

sdsnew

/* Create a new sds string starting from a null terminated C string. */
sds sdsnew(const char *init) {
    size_t initlen = (init == NULL) ? 0 : strlen(init);
    return sdsnewlen(init, initlen);
}

首先获取字符串的长度，然后调用sdsnewlen函数

sdsnewlen

/* Create a new sds string with the content specified by the 'init' pointer
 * and 'initlen'.
 * If NULL is used for 'init' the string is initialized with zero bytes.
 * If SDS_NOINIT is used, the buffer is left uninitialized;
 *
 * The string is always null-termined (all the sds strings are, always) so
 * even if you create an sds string with:
 *
 * mystring = sdsnewlen("abc",3);
 *
 * You can print the string with printf() as there is an implicit \0 at the
 * end of the string. However the string is binary safe and can contain
 * \0 characters in the middle, as the length is stored in the sds header. */
sds sdsnewlen(const void *init, size_t initlen) {
    void *sh;
    sds s;
    char type = sdsReqType(initlen);
    /* Empty strings are usually created in order to append. Use type 8
     * since type 5 is not good at this. */
    if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
    int hdrlen = sdsHdrSize(type);
    unsigned char *fp; /* flags pointer. */
    size_t usable;

    sh = s_malloc_usable(hdrlen+initlen+1, &usable);
    if (sh == NULL) return NULL;
    if (init==SDS_NOINIT)
        init = NULL;
    else if (!init)
        memset(sh, 0, hdrlen+initlen+1);
    s = (char*)sh+hdrlen;
    fp = ((unsigned char*)s)-1;
    usable = usable-hdrlen-1;
    if (usable > sdsTypeMaxSize(type))
        usable = sdsTypeMaxSize(type);
    switch(type) {
        case SDS_TYPE_5: {
            *fp = type | (initlen << SDS_TYPE_BITS);
            break;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
        case SDS_TYPE_16: {
            SDS_HDR_VAR(16,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
        case SDS_TYPE_32: {
            SDS_HDR_VAR(32,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
        case SDS_TYPE_64: {
            SDS_HDR_VAR(64,s);
            sh->len = initlen;
            sh->alloc = usable;
            *fp = type;
            break;
        }
    }
    if (initlen && init)
        memcpy(s, init, initlen);
    s[initlen] = '\0';
    return s;
}

函数的基本流程如下：

1）通过字符串长度，获取对应的类型 char type = sdsReqType(initlen);，代码如下

static inline char sdsReqType(size_t string_size) {
    if (string_size < 1<<5)
        return SDS_TYPE_5;
    if (string_size < 1<<8)
        return SDS_TYPE_8;
    if (string_size < 1<<16)
        return SDS_TYPE_16;
#if (LONG_MAX == LLONG_MAX)
    if (string_size < 1ll<<32)
        return SDS_TYPE_32;
    return SDS_TYPE_64;
#else
    return SDS_TYPE_32;
#endif
}

注意这里使用了静态内联的方式去定义sdsReqType，好处是，因为获取字符串类型的操作非常频繁，通过静态内联函数可以减少内存的访问次数。

2）通过上一步获取到的type，获得Header的长度，int hdrlen = sdsHdrSize(type);，代码如下

static inline int sdsHdrSize(char type) {
    switch(type&SDS_TYPE_MASK) {
        case SDS_TYPE_5:
            return sizeof(struct sdshdr5);
        case SDS_TYPE_8:
            return sizeof(struct sdshdr8);
        case SDS_TYPE_16:
            return sizeof(struct sdshdr16);
        case SDS_TYPE_32:
            return sizeof(struct sdshdr32);
        case SDS_TYPE_64:
            return sizeof(struct sdshdr64);
    }
    return 0;
}

3）接下来通过s_malloc_usable申请了hdrlen+initlen+1的长度空间；表示头部+字符串+'\0'，然后让s指向了字符串开始的地址s = (char*)sh+hdrlen;，fp指向header头部的最后一个字节，fp = ((unsigned char*)s)-1;，也就是header结构体中的flag。

4）接下来进入switch，因为例子中Hello World长度为12，对应类型是SDS_TYPE_5，所以执行了*fp = type | (initlen << SDS_TYPE_BITS);；对于SDS_TYPE_5来说，长度不大于32，所以长度信息可以直接存在flag里面，flag长度为1个字节，8bit，高5位用于存储字符串的长度，低三位用来存储type；对于其他长度类型，会执行SDS_HDR_VAR函数，用于获取sh即sds结构体的首地址，该方法通过头文件sds.h中的宏实现，如下：

#define SDS_HDR_VAR(T,s) struct sdshdr##T *sh = (void*)((s)-(sizeof(struct sdshdr##T)));

e.g: SDS_HDR_VAR(8,s);展开后为语句struct sdshdr8 *sh = (void*)((s)-(sizeof(struct sdshdr8)));，因为上面提到s指向的是字符串的开始地址，通过s地址减去头部结构体的大小，就可以获得sds结构体的开始地址；然后给结构体中的成员赋值：

sh->len = initlen;
sh->alloc = usable;
*fp = type;

5）最后break出来，完成字符串的拷贝操作，给s的结尾置’\0’，s[initlen] = '\0';，返回s的地址，至此，sdsnew调用完毕，此时sds的结构图如下：

 1byte     11byte     1byte
+------+-------------+-----+
| flag | Hello World | \0  |
+------+-------------+-----+

2. 扩容

这里通过调用sdscat函数来演示扩容：

s = sdscat(s, "The length of this sentence is greater than 32 bytes");
printf("Length: %d, Type: %d\n", sdslen(s), sdsReqType(sdslen(s)));

Out>
Length: 64, Type: 1

可以看到，加上新加的字符串以后，长度超过32，sds自动扩容，并把类型转为1，即SDS_TYPE_8；这个在头文件sds.h中定义：

#define SDS_TYPE_5  0
#define SDS_TYPE_8  1
#define SDS_TYPE_16 2
#define SDS_TYPE_32 3
#define SDS_TYPE_64 4

看下sdscat函数的操作过程：

1) 调用sdscatlen函数

/* Append the specified null terminated C string to the sds string 's'.
 *
 * After the call, the passed sds string is no longer valid and all the
 * references must be substituted with the new pointer returned by the call. */
sds sdscat(sds s, const char *t) {
    return sdscatlen(s, t, strlen(t));
}

2)看一下sdscatlen函数的实现

/* Append the specified binary-safe string pointed by 't' of 'len' bytes to the
 * end of the specified sds string 's'.
 *
 * After the call, the passed sds string is no longer valid and all the
 * references must be substituted with the new pointer returned by the call. */
sds sdscatlen(sds s, const void *t, size_t len) {
    size_t curlen = sdslen(s);

    s = sdsMakeRoomFor(s,len);
    if (s == NULL) return NULL;
    memcpy(s+curlen, t, len);
    sdssetlen(s, curlen+len);
    s[curlen+len] = '\0';
    return s;
}

首先获取当前s的长度curlen，接着调用了sdsMakeRoomFor这个函数，很关键，它能保证s的空间足够，如果不够则会进行动态分配，代码如下：

/* Enlarge the free space at the end of the sds string so that the caller
 * is sure that after calling this function can overwrite up to addlen
 * bytes after the end of the string, plus one more byte for nul term.
 *
 * Note: this does not change the *length* of the sds string as returned
 * by sdslen(), but only the free buffer space we have. */
sds sdsMakeRoomFor(sds s, size_t addlen) {
    void *sh, *newsh;
    size_t avail = sdsavail(s);
    size_t len, newlen;
    char type, oldtype = s[-1] & SDS_TYPE_MASK;
    int hdrlen;
    size_t usable;

    /* Return ASAP if there is enough space left. */
    if (avail >= addlen) return s;

    len = sdslen(s);
    sh = (char*)s-sdsHdrSize(oldtype);
    newlen = (len+addlen);
    if (newlen < SDS_MAX_PREALLOC)
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;

    type = sdsReqType(newlen);

    /* Don't use type 5: the user is appending to the string and type 5 is
     * not able to remember empty space, so sdsMakeRoomFor() must be called
     * at every appending operation. */
    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);
    if (oldtype==type) {
        newsh = s_realloc_usable(sh, hdrlen+newlen+1, &usable);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        /* Since the header size changes, need to move the string forward,
         * and can't use realloc */
        newsh = s_malloc_usable(hdrlen+newlen+1, &usable);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    usable = usable-hdrlen-1;
    if (usable > sdsTypeMaxSize(type))
        usable = sdsTypeMaxSize(type);
    sdssetalloc(s, usable);
    return s;
}

首先通过size_t avail = sdsavail(s);，获取当前s的可用空间，sdsavail函数如下：

static inline size_t sdsavail(const sds s) {
    unsigned char flags = s[-1];
    switch(flags&SDS_TYPE_MASK) {
        case SDS_TYPE_5: {
            return 0;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            return sh->alloc - sh->len;
        }
        case SDS_TYPE_16: {
            SDS_HDR_VAR(16,s);
            return sh->alloc - sh->len;
        }
        case SDS_TYPE_32: {
            SDS_HDR_VAR(32,s);
            return sh->alloc - sh->len;
        }
        case SDS_TYPE_64: {
            SDS_HDR_VAR(64,s);
            return sh->alloc - sh->len;
        }
    }
    return 0;
}

对于SDS_TYPE_5直接返回0，对于其他类型，通过Header中的alloc和len相减sh->alloc - sh->len获取；

if (avail >= addlen) return s;接着判断大小，如果可用空间足够，直接返回s；
否则，
```
len = sdslen(s);
sh = (char*)s-sdsHdrSize(oldtype);
newlen = (len+addlen);
if (newlen < SDS_MAX_PREALLOC)
   newlen *= 2;
else
   newlen += SDS_MAX_PREALLOC;
  
type = sdsReqType(newlen);
```
获取当前长度，加上新增的长度，如果新的长度没有超过最大长度SDS_MAX_PREALLOC=1024*1024，则直接给新长度*2，这样动态扩容可以避免频繁的malloc操作；type = sdsReqType(newlen);获取新长度对应的sds类型；
然后判断新长度对应的类型是否为SDS_TYPE_5，如果是的话，把类型置为SDS_TYPE_8，因为SDS_TYPE_5不记录空闲的空间大小，这样每次操作都会调用一次sdsMakeRoomFor函数，非常影响效率；

接下来判断类型是否发生了变化，来决定是直接扩充空间，还是重新申请空间;

类型未发生变化，直击扩充空间：

newsh = s_realloc_usable(sh, hdrlen+newlen+1, &usable);
if (newsh == NULL) return NULL;
s = (char*)newsh+hdrlen;

我们例子这里未发生了变化，从SDS_TYPE_5变成了SDS_TYPE_8，需要重新申请空间：

/* Since the header size changes, need to move the string forward,
* and can't use realloc */
newsh = s_malloc_usable(hdrlen+newlen+1, &usable);		//重新申请空间
if (newsh == NULL) return NULL;		
memcpy((char*)newsh+hdrlen, s, len+1);		//将String拷贝到新的String部分
s_free(sh);																//将旧的sds全部释放
s = (char*)newsh+hdrlen;									//重新将s指向新的sds的字符串首地址
s[-1] = type;															//设置type
sdssetlen(s, len);												//设置大小

最后，

usable = usable-hdrlen-1;						//获取可用空间，即alloc
if (usable > sdsTypeMaxSize(type))	//如果超过最大可用空间，则置为sdsTypeMaxSize
   usable = sdsTypeMaxSize(type);
sdssetalloc(s, usable);							//设置结构体成员alloc的值
return s;														//返回新的s地址

回到sdscatlen函数，接下来把需要添加的字符串拷贝至新的空间，然后设置长度和最后的’\0’

memcpy(s+curlen, t, len);
sdssetlen(s, curlen+len);
s[curlen+len] = '\0';
return s;

至此，s变成了这个样子：

 3byte            128byte             1byte
+----+-----+----+--------------------+-----+
|len |alloc|flag| Hello World The ...| \0  |
+----+-----+----+--------------------+-----+

注意下，代码输出的长度64是字符串的长度，即len，128对应的是alloc，已分配的大小，此时如果需要再追加小于64字节的内容，就无需再重复扩容的操作了。

3. 销毁

这个过程很简单，直接调用sdsfree函数，

/* Free an sds string. No operation is performed if 's' is NULL. */
void sdsfree(sds s) {
    if (s == NULL) return;
    s_free((char*)s-sdsHdrSize(s[-1]));
}

如果为空，直接返回；否则获取header的首地址然后释放，通过sdsHdrSize(s[-1])直接获取header的长度，用s减去header的长度就可以获取到结构体的首地址了。