Google Summer of Code 2026
In 2026, we are again participating in the Google Summer of Code. We, joined with the MariaDB Foundation, believe we are making a better database that remains application-compatible with MySQL. We also work on making LGPL connectors (currently C, C++, ODBC, Java, Node.js) and on MariaDB Galera Cluster, which allows you to scale your reads & writes. And we have MariaDB ColumnStore, which is a columnar storage engine, designed to process petabytes of data with real-time response to analytical queries.
Where to Start
Please join us on Zulip to mingle with the community. You should also subscribe to the developers mailing list (this is the main list where we discuss development - there are also other mailing lists).
To improve your chances of being accepted, it is a good idea to submit a pull request with a bug fix to the server.
Also see the List of beginner friendly issues from the MariaDB Issue Tracker.
List of Tasks (work in progress, check again tomorrow)
MariaDB Server
MDEV-28395 LOAD DATA plugins
Full-time project 350h
LOAD DATA INFILE can flexibly load data into a table from CSV-like files accessible by the mariadbdb process. LOAD XML INFILE can do it for XML files. LOAD DATA LOCAL INFILE and LOAD XML LOCAL INFILE can do it with files accessible by the client, but not by the server. But there are requests to support loading more file formats and from other locations, for example, from S3.
This project is to implement support for LOAD plugins and refactor the current LOAD code accordingly. There are two kind of plugins — data parser plugin (CSV-like and XML) and transfer plugin (file and LOCAL). Implementing new plugins is not in the scope of this task, this task is mainly about moving existing code around, creating a possibility for new plugins (like JSON or S3).
Skills needed: C++, bison Mentors: Sergei Golubchik
MDEV-7924 START SLAVE UNTIL to Support Timestamps
Part-time project 175h
Users can control a point at which a replica will automatically stop applying events from the primary via START REPLICA UNTIL. Currently, this only allows GTIDs and log-offsets (file-name, binlog offset), which usually requires users to manually examine the binary log on the master to find the exact transaction one wants to stop at. Often, users won't care about the specifics on the exact transaction to stop at, but rather the goal is to create a known/consistent state, e.g. across multiple different slaves (possibly of different masters). That is, it currently requires users to use mariadb-binlog to manually analyze the binary log files in the master server's binlog directory, find the transaction identifier (GTID or log-offset) of some transaction at the timestamp they want to stop the slaves at, and input that into their STOP SLAVE UNTIL statement. With multiple masters, this process would need to be repeated for each master.
This project is to implement support for timestamps in START SLAVE UNTIL, to simplify the aforementioned process. To define inclusive/exclusive behaviors, it would be good to be consistent with the existing GTID-based keywords SQL_BEFORE_GTIDS and SQL_AFTER_GTIDS, i.e. to define SQL_BEFORE_TIMESTAMP and SQL_AFTER_TIMESTAMP, respectively.
Skills needed: C++, Lex/Yacc Mentors: Brandon Nesterenko
MDEV-38721 one-pass HNSW search
Full-time project 350h
The idea here is to treat HNSW graph as a flat one-level non-hierarchical graph and search on all layers at once. Without actual flattening, so let's call it VF-HNSW, Virtually Flattened Hierarchical Navigable Small World.
This project is to implement this algorithm and benchmark it using ann-benchmarks.
Skills needed: C++, Python Mentors: Sergei Golubchik
MDEV-33411 OPTIMIZE TABLE for graph indexes
Full-time project 350h
Bulk insert into mHNSW index. There are various optimizations we can implement when all vectors are available in advance.
Skills needed: C++ Mentors: Sergei Golubchik
MDEV-31342 I_S optimization: avoid temp table
Part-time project 175h or can be combined with MDEV-31535 for a Full-time project 350h
Currently information_schema tables work like:
prepare information_schema table
this creates a temporary table
call the information_schema implementation code
it sets values using
Field::store()and callsschema_table_store_record()per rowschema_table_store_record()useshandler::ha_write_row()to store the row in he temporary table
when the temporary table is filled with data, it's used in the query.
For queries like SELECT f1, f2, ... FROM INFORMATION_SCHEMA.tbl the above adds a lot of overhead. The server can recognize that case, not create a temporary table in the step 1. And modify schema_table_store_record() to send results directly to the client.
Skills needed: C++ Mentors: Oleksandr Byelkin
MDEV-31535 optimize directory listing for information_schema tables based on privileges
Part-time project 175h or can be combined with MDEV-31342 for a Full-time project 350h
Usually when INFORMATION_SCHEMA.TABLES (or any other table that is implemented via get_all_tables() function) is queried, it creates a list of all schemas first, then for every schema it creates a list of all files in that schema.
In certain cases the above is optimized:
when a specific table is requested via
TABLE_SCHEMA=xxx AND TABLE_NAME=yyyin theWHEREclase — in this case only that one table is openedwhen a specific schema is requested via
TABLE_SCHEMA=xxx— tables for only that schema are listed, the list of all schemas is not createdwhen privileges only allow access to certain schemas — the list of all schemas is created, but tables are listed only for those schemas that pass the privilege check
Note that in the last case the server still creates a list of all schemas. This can be expensive, if there're thousands of them and the privileges only allow access to one specific schema. It makes sense to treat this case as if the schema name was explicitly specified on the WHERE clause. Almost, because the user will also have access to the INFORMATION_SCHEMA itself, but it's already treated specially anyway.
That is:
if the user does not have global grants that allow to see all schemas, then
for every schema-level (and table-level?) grant:
if the schema name is not a pattern (does not contain wildcards), directly append this schema name to the list, if the schema exists
append "INFORMATION_SCHEMA"
if the above isn't true — fallback to the directory listing.
Skills needed: C++ Mentors: Oleksandr Byelkin
MDEV-38329 Named parameters in invocation of stored routines
Full-time project 350h
Add support for the syntax like
CALL proc(param3 => 10, param5 => "foo");
SELECT func(param2 => 3.1415);not explicitly mentioned parameters get their default values.
Skills needed: C++ Mentors: Alexander Barkov
MDEV-12320 configurable default authentication plugin for the server
Full-time project 350h
configurable default authentication plugin for the server.
"default" applies to the plugin name that the server uses for the first handshake packet, what plugin the server uses when no username is yet known.
Skills needed: C, C++ Mentors: Nikita Malyavin
MDEV-13594 Support for JSON operators column->path and column->>path
Part-time project 175h or can be combined with MDEV-38591 for a Full-time project 350h
Implement this syntax sugar in MariaDB for MySQL compatibility
Skills needed: C++ Mentors: Rucha Deodhar
MDEV-38591 MEMBER OF json operator
Part-time project 175h or can be combined with MDEV-13594 for a Full-time project 350h
Implement {{MEMBER OF}} operator for MySQL compatibility.
Skills needed: C++ Mentors: Rucha Deodhar
MDEV-37591 Binlog Table Map Event to be a Sequential Index
Part-time project 175h
The mapping defined by a binary log Table_map_log_event can be revamped to improve slave efficiency. Currently, when a transaction is binlogged using binlog_format=ROW, a Table_map event is written in the binary log to declare a table that the transaction is updating, and includes information to identify this table on the slave. In particular, this information includes an identification number (table_id) that is used by a Rows log event in this transactions, which specifies that the given row event is targeting that given table. This table_id is only applicable for the server which actually logged the event, and is meaningless to the slave for execution, outside of its use to identify the table to target. However, when the slave uses this table_id to identify a table, it does so by searching/iterating through a list of all tables targeted by the transaction.
This search for a table can be optimized by changing the assignment strategy of the table_id to effectively work as an index into the list of tables targeted by the transaction. That is, instead of using the actual table_id of the given table on the master server, the value can be filled in using some 0-indexed counter. Then when the slave needs to find the table that a given row event is targeting, it would use this index to simply access the table directly (rather than iteratively search).
Skills needed: C++ Mentors: Brandon Nesterenko
Suggest a Task
Do you have an idea of your own, not listed above? Do let us know in the comments below (Click 'Login' on the top of the page first)!
This page is licensed: CC BY-SA / Gnu FDL
Last updated
Was this helpful?

