Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Load Testing with MySQL in Firecracker Environment - Buffer Size Issue #1999

Open
Roopsai507 opened this issue Feb 9, 2024 · 11 comments · Fixed by #2002
Open

Error in Load Testing with MySQL in Firecracker Environment - Buffer Size Issue #1999

Roopsai507 opened this issue Feb 9, 2024 · 11 comments · Fixed by #2002
Assignees

Comments

@Roopsai507
Copy link

Roopsai507 commented Feb 9, 2024

I'm facing an issue while conducting load test with HammerDB on MySQL within a Firecracker environment.
I'm encountering error messages related to buffer sizes and frame handling. Below is a snippet of the error messages:

2024-02-09T11:12:26.690064527 [9090:main] Receiving buffer is too small to hold frame of current size
2024-02-09T11:12:26.690120360 [9090:main] Receiving buffer is too small to hold frame of current size
2024-02-09T11:12:26.690126586 [9090:main] Receiving buffer is too small to hold frame of current size

frame trace:
ffffc00001e07f50:   ffffffff8005b624
ffffc00001e07fc0:   ffffffff800472ba

loaded klibs:
assertion len <= x->p.pbuf.len failed at /home/circleci/project/src/virtio/virtio_net.c:196 (IP 0xffffffff80106807)  in vnet_input(); halt

frame trace:
ffffc00001d1b060:   ffffffff800467d2
ffffc00001d1b0c0:   ffffffff80047236
ffffc00001d1b3c0:   ffffffff80044168
ffffc00001d1b3d0:   ffffffff800467d2
ffffc00001d1b430:   ffffffff80047236
ffffc00001d1b730:   ffffffff80044168
ffffc00001d1b740:   ffffffff800467d2
ffffc00001d1b7a0:   ffffffff80047236
ffffc00001d1baa0:   ffffffff80044168
ffffc00001d1bab0:   ffffffff800467d2
ffffc00001d1bb10:   ffffffff80047236
ffffc00001d1be10:   ffffffff80044168
ffffc00001d1be20:   ffffffff800467d2
ffffc00001d1be80:   ffffffff80047236
ffffc00001d1c180:   ffffffff80044168
ffffc00001d1c190:   ffffffff800467d2
ffffc00001d1c1f0:   ffffffff80047236
ffffc00001d1c4f0:   ffffffff80044168
ffffc00001d1c500:   ffffffff800467d2
ffffc00001d1c560:   ffffffff80047236
ffffc00001d1c860:   ffffffff80044168
ffffc00001d1c870:   ffffffff800467d2
ffffc00001d1c8d0:   ffffffff80047236
ffffc00001d1cbd0:   ffffffff80044168
ffffc00001d1cbe0:   ffffffff800467d2
ffffc00001d1cc40:   ffffffff80047236
ffffc00001d1cf40:   ffffffff80044168
ffffc00001d1cf50:   ffffffff800467d2
ffffc00001d1cfb0:   ffffffff80047236
ffffc00001d1d2b0:   ffffffff80044168
ffffc00001d1d2c0:   ffffffff800467d2
ffffc00001d1d320:   ffffffff80047236

loaded klibs:
assertion buffer_extend(v, sizeof(void *)) failed at /home/circleci/project/src/runtime/vector.h:73 (IP 0xffffffff80044130)  in vector_push(); halt
validate_page error: objcache 0xffff80000005ad38, footer 0xffffc00001dfffe0, bad magic! (10)
objcache_allocate error: alloc failed

Test Configurations:

Nanos Kernel - ~/.ops/d875bfe/kernel.img
Ops version: 0.1.40
Nanos version: 0.1.49
MySQL: 8.0.35
Firecracker: v1.6.0
HammerDB: 4.7

HammerDB Testing Setup:

./hammerdbcli
hammerdb>cat buildSchema.tcl
puts "SETTING CONFIGURATION"
dbset db mysql
diset connection mysql_host 11.24.45.251
diset connection mysql_port 3306
diset tpcc mysql_pass root
diset tpcc mysql_count_ware 800
diset tpcc mysql_partition true
diset tpcc mysql_num_vu 64
diset tpcc mysql_storage_engine innodb
print dict
buildschema
hammerdb>source buildSchema.tcl
@francescolavra
Copy link
Member

The issue causing the above error messages has been fixed in #2002. To test this fix, you can create a new image with Ops by specifying --nanos-version 659f598 in your ops image create command, and then use ~/.ops/659f598/kernel.img as kernel image in your Firecracker configuration.

@Roopsai507
Copy link
Author

The same I have tried with postgres(Image - francescolavra/postgres:16.0) with the updated kernel image ~/.ops/659f598/kernel.img

Got the below memory error and kernel got crashed.

2024-03-19 13:17:57.245 UTC [4361762560] WARNING:  plancache reference leak: plan 0x10c2e8658 not closed
2024-03-19 13:17:57.255 UTC [4351211264] ERROR:  lock reference 0x1042e89b8 is not owned by resource owner (null)
2024-03-19 13:17:57.257 UTC [4351211264] WARNING:  AbortTransaction while in COMMIT state
2024-03-19 13:17:57.258 UTC [4351211264] PANIC:  cannot abort transaction 4883, it was already committed

*** signal 6 received by tid 15, errno 0, code -6

*** Thread context:
lastvector: 00000000000000ea
     frame: ffffc0000261b800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000

@eyberg eyberg reopened this Mar 19, 2024
@francescolavra
Copy link
Member

The above log does not show a kernel crash, it's the postgres program that triggered an unhandled SEGV signal and therefore has been terminated by the kernel. Anyway, I tried load-testing postgres with HammerDB but I couldn't reproduce this issue.
I'm using Firecracker v1.6.0, and tried with both kernel build 659f598 and the latest nightly build.
@Roopsai507 could you share the tcl file you are using to build the postgres schema in HammerDB? Mine is as follows:

puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
print dict
buildschema

@Roopsai507
Copy link
Author

Roopsai507 commented Mar 23, 2024

First run the below build schema tcl file

> cat buildSchema.tcl
puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
diset connection pg_port 9090
diset tpcc pg_count_ware 5
diset tpcc pg_superuser postgres
diset tpcc pg_num_vu 5
diset tpcc pg_user tpcc
diset tpcc pg_pass tpcc
diset tpcc pg_dbase tpcc
print dict
buildschema

After this run the below tcl file the issue will be reproduced

> cat pgrun.tcl          
#!/bin/tclsh
proc runtimer { seconds } {
set x 0
set timerstop 0
while {!$timerstop} {
incr x
after 1000
if { ![ expr {$x % 60} ] } {
set y [ expr $x / 60 ]
puts "Timer: $y minutes elapsed"
}
update
if { [ vucomplete ] || $x eq $seconds } { set timerstop 1 }
}
return
}

puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
diset connection pg_port 9090
diset tpcc pg_superuser postgres
diset tpcc pg_user tpcc
diset tpcc pg_pass tpcc
diset tpcc pg_dbase tpcc
diset tpcc pg_driver timed
diset tpcc pg_duration 2
diset tpcc pg_duration 5
diset tpcc pg_vacuum true
print dict
vuset logtotemp 1
loadscript
puts "SEQUENCE STARTED"
foreach z {1} {
puts "$z VU test"
vuset vu $z
vucreate
vurun
runtimer 480
vudestroy
after 5000
}
puts "TEST SEQUENCE COMPLETE"

@francescolavra
Copy link
Member

francescolavra commented Apr 17, 2024

I found some issues in both the postgres:16.0 package and the Nanos kernel. The postgres issues have been fixed in a new package I uploaded (francescolavra/postgres:16.2), while the kernel issues have been fixed in 95d807a. With these fixes, the HammerDB load test completes successfully.

@Roopsai507
Copy link
Author

I tried testing it with the nanos-kernel version 95d807a
ops image create -c config.json --imagename postgres --package francescolavra/postgres_16.2 --nanos-version 95d807a

But I couldn't get it to start. When I tried starting the Firecracker VM with the kernel image 95d807a, I encountered a segmentation error. It seems there's an issue with the kernel image. I've tested with other kernel versions, and PostgreSQL starts fine with those.

2024-04-23T08:37:39.278245165 [9090:main] Running Firecracker v1.6.0
2024-04-23T08:37:39.376741397 [9090:main] Artificially kick devices.
2024-04-23T08:37:39.376852171 [9090:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161

*** signal 11 received by tid 2, errno 0, code 2
    fault address 0xfffd7db8

*** Thread context:
lastvector: 000000000000000e (Page fault)
     frame: ffffc00002601800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000007
   address: 00000000fffd7db8

@francescolavra
Copy link
Member

I'm unable to reproduce the issue here. Does PostrgreSQL start fine if you run it with Ops (i.e. with ops pkg load francescolavra/postgres_16.2 -c config.json --imagename postgres --nanos-version 95d807a)? If you get the same error, could you share the contents of the config.json file you are using in the Ops command line?

@Roopsai507
Copy link
Author

Roopsai507 commented Apr 23, 2024

The command ops pkg load works fine with the given configuration.

However, when I used the same generated image (postgres) and kernel image (95d807a) with Firecracker, I encountered the error mentioned earlier.
If I replace the kernel image with any other version, such as 659f598 or d875bfe, it works without issues(PostrgreSQL starts fine).

cat config.json 
{
    "RunConfig": {
        "IPAddress": "11.244.15.161",
        "NetMask": "255.255.255.252",
        "Gateway": "11.244.15.162",
	"BridgeName": "eth0"
    },
    "BaseVolumeSz": "20000m"
}

Command for image creation :
ops image create -c config.json --imagename postgress --package francescolavra/postgres_16.2

Kernel Given to Firecracker: ~/.ops/95d807a/kernel.img

@francescolavra
Copy link
Member

It still works fine here. My Firecracker config file is as follows:

$ cat ../vm_config.json 
{
  "boot-source": {
    "kernel_image_path": "/home/francesco/.ops/95d807a/kernel.img"
  },
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "/home/francesco/.ops/images/postgress",
      "is_root_device": true,
      "is_read_only": false
    }
  ],
  "network-interfaces": [
    {
      "iface_id": "eth0",
      "guest_mac": "AA:FC:00:00:00:01",
      "host_dev_name": "tap1"
    }
  ],
  "machine-config": {
    "vcpu_count": 1,
    "mem_size_mib": 2048
  }
}

Then I run Firecracker as below:

$ /opt/firecracker-v1.6.0-x86_64 --no-api --config-file ../vm_config.json
2024-04-23T16:53:06.399909931 [anonymous-instance:main] Running Firecracker v1.6.0
2024-04-23T16:53:06.420688187 [anonymous-instance:main] Artificially kick devices.
2024-04-23T16:53:06.421089364 [anonymous-instance:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161
2024-04-23 14:53:06.548 UTC [4296224768] LOG:  starting PostgreSQL 16.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.4.0-1ubuntu1~16.04.1) 8.4.0, 64-bit
2024-04-23 14:53:06.566 UTC [4296224768] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-04-23 14:53:06.574 UTC [4296224768] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-04-23 14:53:06.590 UTC [4311586560] LOG:  database system was shut down at 2024-03-27 16:26:45 UTC
2024-04-23 14:53:06.604 UTC [4296224768] LOG:  database system is ready to accept connections
en1: assigned FE80::A8FC:FF:FE00:1

What Firecracker command line are you using? And what are the contents of your Firecracker config file?

@Roopsai507
Copy link
Author

Roopsai507 commented Apr 23, 2024

sudo firecracker-v1.6.0-x86_64 --no-api --config-file /home/roop/fcconfig.json
2024-04-23T20:58:50.543184053 [anonymous-instance:main] Running Firecracker v1.6.0
2024-04-23T20:58:50.624887552 [anonymous-instance:main] Artificially kick devices.
2024-04-23T20:58:50.625226239 [anonymous-instance:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161

*** signal 11 received by tid 2, errno 0, code 2
    fault address 0xffde5db8

*** Thread context:
lastvector: 000000000000000e (Page fault)
     frame: ffffc00002601800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000007
   address: 00000000ffde5db8

   rax: 0000000000000000
   rbx: 0000000000000000
   rcx: 0000155a75c9cc30

FC Command : sudo firecracker-v1.6.0-x86_64 --no-api --config-file /home/roop/fcconfig.json

ConfigFile :

cat fcconfig.json 
{
  "boot-source": {
    "kernel_image_path": "/home/roop/kernel_95d807a.img"
  },
  "machine-config": {
    "vcpu_count": 4,
    "mem_size_mib": 13404
  },
  "network-interfaces": [
    {
      "iface_id": "bond0",
      "host_dev_name": "fc-1000-tap0"
    }
  ],
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "/home/roop/images/postgress",
      "is_root_device": true,
      "is_read_only": false
    }

 ]
}


I think I found the issue ,
have changed the machine-config

{
    "vcpu_count": 4,
    "mem_size_mib": 13404
  }

to your given conf and working fine

{
    "vcpu_count": 1,
    "mem_size_mib": 2048
  }

May I know why it is causing the issue

francescolavra added a commit that referenced this issue Apr 24, 2024
Depending on the address of the initial pages allocated in
setup_initmap(), the map_setup_2mbpages() function may not find an
existing PDPT and or PDT in the page tables, and thus may need to
use a new PDPT and/or a new PDT. However, the PDPT and PDT
addresses passed to this function correspond to in-use pages, and
as such cannot be reused and assigned to new PTEs. Moreover, when
creating a new PTE, the map_setup_2mbpages() function is simply
OR-ing the page address with the page flags, failing to set the
USER flag and thereby preventing any pages referenced by this
directory entry from being mapped for user space access; this may
cause bogus segmentation fault signals to be delivered to the user
process (#1999 (comment)).

This change fixes the above issues by amending map_setup_2mbpages()
so that it takes a physical memory region from which to allocate
any new pages it may need, and calls new_level_pte() when creating
a new PTE.
@francescolavra
Copy link
Member

OK, thanks, you uncovered another bug in the kernel. This is fixed in #2019, and you can retrieve the fixed kernel build with --nanos-version d9743ea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants