Remote RAM disk with RDMA
In this post I’ll show you how to use iSER, iSCSI, and LIO to setup a remote RAM disk. This is useful if you need high IOPS but don’t have access to a bunch of SSDs or NVRAM. Note that the performance achieved in this post is quite low compared to what you should be able to achieve with different hardware. Currently the arm64 machines we are using aren’t getting the performance expected, and tuning is on going. However, the description of the steps here are relevant for other installations. Once you create several remote RAM disks, tie them together with RAID-0 or dm-linear.
We’ll use the following hardware provided by CloudLab.
- HP Moonshot m400
- Eight 64-bit ARMv8 (Atlas/A57) cores at 2.4 GHz (APM X-GENE)
- 64GB ECC Memory (8x 8 GB DDR3-1600 SO-DIMMs)
- 120 GB of flash (SATA3 / M.2, Micron M500)
- Dual-port Mellanox ConnectX-3 10 GB NIC (PCIe v3.0, 8 lanes)
Next I’ll show you the basic server and client setup, and then demonstrate usage with some basic benchmarks.
Target (Server) Setup #
Make the RAMDisk backing store:
/> /backstores/rd_mcp create name=rd1 size=50G
Generating a wwn serial.
Created rd_mcp ramdisk rd1 with size 50G.
Make the iSCSI target:
/> /iscsi create
Created target iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040.
Selected TPG Tag 1.
Successfully created TPG 1.
Create a LUN backed by the RAMDisk:
/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/luns create storage_object=/backstores/rd_mcp/rd1
Selected LUN 0.
Successfully created LUN 0.
Create a portal for the iSCSI target:
/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/portals create 10.10.1.3
Using default IP port 3260
Successfully created network portal 10.10.1.3:3260.
Enable iSER on the portal:
/> iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/portals/10.10.1.3:3260 iser_enable
iser operation has been enabled
Here is the final configuration
/> ls
o- / ..................................................................... [...]
o- backstores .......................................................... [...]
| o- fileio ............................................... [0 Storage Object]
| o- iblock ............................................... [0 Storage Object]
| o- pscsi ................................................ [0 Storage Object]
| o- rd_dr ................................................ [0 Storage Object]
| o- rd_mcp ............................................... [1 Storage Object]
| o- rd1 ............................................... [ramdisk activated]
o- ib_srpt ....................................................... [0 Targets]
o- iscsi .......................................................... [1 Target]
| o- iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040 ...... [1 TPG]
| o- tpgt1 ....................................................... [enabled]
| o- acls ....................................................... [0 ACLs]
| o- luns ........................................................ [1 LUN]
| | o- lun0 ....................................... [rd_mcp/rd1 (ramdisk)]
| o- portals .................................................. [1 Portal]
| o- 10.10.1.3:3260 ................................. [OK, iser enabled]
o- loopback ...................................................... [0 Targets]
o- qla2xxx ....................................................... [0 Targets]
o- tcm_fc ........................................................ [0 Targets]
Finally, disable all security
/> /iscsi/iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.385e783e5040/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
Parameter authentication is now '0'.
Parameter demo_mode_write_protect is now '0'.
Parameter generate_node_acls is now '1'.
Parameter cache_dynamic_acls is now '1'.
Initiator (Client) Setup #
From the client, also called the initiator, we can use the iscsiadm
tool to
look for the targets we have created. In this case we’ve setup one iSCSI
target on the node with address 10.10.1.3
:
nwatkins@node-0:~$ sudo iscsiadm -m discovery -t sendtargets -p 10.10.1.3
10.10.1.3:3260,1 iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde
To access the iSCSI targets as local devices we need to login to the targets. We can login to all of the targets that have been discovered with the following command:
nwatkins@node-0:~$ sudo iscsiadm -m node -L all
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.
Now that we have logged into the target, we should be able to access the LUN
we setup as a local device. We can see the device has been attached by
examining dmesg
:
[ 1653.685547] scsi2 : iSCSI Initiator over TCP/IP
[ 1653.941126] scsi 2:0:0:0: Direct-Access LIO-ORG RAMDISK-MCP 4.0 PQ: 0 ANSI: 5
[ 1653.941314] sd 2:0:0:0: Attached scsi generic sg1 type 0
[ 1653.942324] sd 2:0:0:0: [sdb] 104857600 512-byte logical blocks: (53.6 GB/50.0 GiB)
[ 1653.942717] sd 2:0:0:0: [sdb] Write Protect is off
[ 1653.942721] sd 2:0:0:0: [sdb] Mode Sense: 43 00 00 08
[ 1653.942880] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 1653.944324] sdb: unknown partition table
[ 1653.945174] sd 2:0:0:0: [sdb] Attached SCSI disk
The iscsiadm
tool also lets us look at a lot of information about the
targets. Using the following command we can see some of the networking
configuration for the targets:
nwatkins@node-0:~$ sudo iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873
Target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde
Current Portal: 10.10.1.3:3260,1
Persistent Portal: 10.10.1.3:3260,1
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.1993-08.org.debian:01:a41f7afa2fc8
Iface IPaddress: 10.10.1.1
...
Notice that the Iface Transport
option is set to tcp
. In order to get
maximum performance using RDMA, we want to use the iSER transport instead. To
set this we first need to logout of the targets:
nwatkins@node-0:~$ sudo iscsiadm -m node -U all
Logging out of session [sid: 1, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260]
Logout of [sid: 1, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.
Next, set the iface.transport_name
option to iser
for our target:
nwatkins@node-0:~$ sudo iscsiadm -m node -T iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde -o update -n iface.transport_name -v iser
Now we can log back in to the target and check to be sure that the transport has been set to iSER. Login:
nwatkins@node-0:~$ sudo iscsiadm -m node -L all
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde, portal: 10.10.1.3,3260] successful.
Check transport:
nwatkins@node-0:~$ sudo iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-870
version 2.0-873
Target: iqn.2003-01.org.linux-iscsi.node-1.aarch64:sn.e2d0d3a3bfde
Current Portal: 10.10.1.3:3260,1
Persistent Portal: 10.10.1.3:3260,1
**********
Interface:
**********
Iface Name: default
Iface Transport: iser
Iface Initiatorname: iqn.1993-08.org.debian:01:a41f7afa2fc8
...
Success!
Benchmarks #
We are going to use the fio
tool to do 512 byte direct I/O random reads and
writes to the remote RAM disk devices. With a single device we are able to get
around 70,000 random read and write IOPS. Here is the write workload:
nwatkins@node-0:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --filename=/dev/sdb --name=sdb
asdf: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 1 process
^Cbs: 1 (f=1): [w] [0.0% done] [0KB/34454KB/0KB /s] [0/68.1K/0 iops] [eta 115d:17h:45m:24s]
And the read workload:
inwatkins@node-0:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --filename=/dev/sdb --name=sdb
asdf: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 1 process
^Cobs: 1 (f=1): [r] [0.0% done] [35481KB/0KB/0KB /s] [70.1K/0/0 iops] [eta 115d:17h:45m:53s]s]
I would expect the performance to be better.
One iSCSI Target with 2 LUNs #
Next we try again with one iSCSI target hosting two LUNs, and instruct fio
to send IOs to both devices. For writes we get slightly less than 2x speed-up
at 113,000 IOPS:
nwatkins@node-0:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --
filename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [ww] [0.0% done] [0KB/56665KB/0KB /s] [0/113K/0 iops] [eta 115d:17h:46m:05s]
And right about 2x speed-up for reads:
nwatkins@node-0:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [rr] [0.0% done] [60537KB/0KB/0KB /s] [121K/0/0 iops] [eta 115d:17h:46m:25s]
This still seems really slow.
Two iSCSI portals With 2 LUNs #
Next we try to create separate portals. We associate one LUN with each portal. Writes are now performing a bit better than 2x at 131K IOPS, but this is probably a peak we caught. For the most part its about 2x:
nwatkins@node-0:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [ww] [0.0% done] [0KB/65281KB/0KB /s] [0/131K/0 iops] [eta 115d:17h:45m:36s]
And reads are a bit better too at 126K IOPS:
nwatkins@node-0:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --fi
lename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 2 processes
^Cbs: 2 (f=3): [rr] [0.0% done] [63220KB/0KB/0KB /s] [126K/0/0 iops] [eta 115d:17h:44m:26s]
Still not that great.
Four Targets With Four LUNs #
The next experiment is four targets each with a separate LUN. Now we get roughly 200K IOPS for read and write workloads.
Write workload:
nwatkins@node-0:~$ sudo fio --rw=randwrite --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --f
ilename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc --filename=/dev/sdd --name=sdd --filename=/dev/sde --name=sde
sdb: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdd: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sde: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 4 processes
^Cbs: 4 (f=7): [wwww] [0.0% done] [0KB/98105KB/0KB /s] [0/196K/0 iops] [eta 115d:17h:44m:53s]
And the read workload:
nwatkins@node-0:~$ sudo fio --rw=randread --bs=512 --numjobs=1 --iodepth=128 --runtime=9999999 --time_based --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --exitall --fi
lename=/dev/sdb --name=sdb --filename=/dev/sdc --name=sdc --filename=/dev/sdd --name=sdd --filename=/dev/sde --name=sde
sdb: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdc: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sdd: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
sde: (g=0): rw=randread, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=128
fio-2.1.3
Starting 4 processes
^Cbs: 4 (f=7): [rrrr] [0.0% done] [98.11MB/0KB/0KB /s] [201K/0/0 iops] [eta 115d:17h:46m:13s]
I really expect much higher IOPS.
Optimizations #
According to this page https://vanity-mellanoxexternal.jiveon.com/docs/DOC-1483 there are a lot of different optimizations that can be applied to help squeeze out more performance. However, after applying most of the optimizations the performance doesn’t really improve for me. While the page above isn’t using RoCE and they are using x86 rather than arm64, they are able to get almost 2 million IOPS. I’m hoping I can figure out how to get more IOPS out of our setup.