Matrix protocol does have a nice 'bridge' concept to integrate with other chat systems which gives me confidence that the Matrix format would be a suitable one for the messages. I imagine writing programs that synchronise messages from their respective chat servers and map them into the matrix format before storing on the filesystem.
I'd term this the chat retrieval agent (CRA).
Chat system -> CRA -> matrixdir
Each CRA would be responsible for connecting to the chat system in question, downloading the messages for the user, and translating them into matrix format. The messages would then be stored unencrypted on the filesystem.
I'd term this the chat user agent (CUA).
matrixdir -> CUA
The CUA here is a viewer of messages for the user, it just reads data off disk and builds its ui from that. It should be able to also watch the filesystem for new messages so that it doesn't have to reload everything.
I'd term this the chat submission agent (CSA).
CUA -> CSA -> Chat system
This is where I'm currently a bit fuzzy. Keeping to how it works in email, and keeping login logic out of the CUA, I want the CUA to invoke a process to send a message or write to a socket or something, with the matrix format as input. Then the CSA works out encrypting it or not, and sending it to the server. What I'm unsure about is whether the message needs to be copied to the matrixdir at this point, or whether the CRA should get it during a sync (or if we do both how do we avoid duplicates).
chats/
lock.pid
room1/
<timestamp_millis>-events.jsonl
attachments/
<server-name>/
<media-id>
room2/
<timestamp_millis>-events.jsonl
My proposal is that, during synchronisation, events are split by the 'room' that they are for (what chat the message is in), and there is a directory for the events in each room. Within the directory for each room are files split by the CRA's chunking choice, but depict time slices of the events. A file's name indicates the timestamp (in milliseconds) of the first event in its file, and should not include events from before this timestamp. Events should be added to the latest timestamp file that is available. The CRA is free to create new files based on, for example, each day, each week, or number of messages. The CRA only appends new events to files, never modifying events that have already been written. The particular format is JSON lines as the matrix messages can be arbitrary, and they have a convenient JSON representation which is self-describing, useful for use with tools like jq
. When initialising, the CRA must acquire the file lock on the lock.pid
file in the root of the chats
directory, and write its own process ID into the file.
Attachments are stored in an attachments
directory next to the events file for each room. The location within is determined from the matrix content URI (mxc://<server-name>/<media-id>
). This enables leaving the original event untouched while being able to check whether the attachment has been downloaded.
When a client (CUA) wants to view chats it can create a notify subscription on the event files for the rooms it is interested in, and maintain seek positions in them for where it has read to. Then, when given a notification that a file has been updated, it can read from its seek position in that file, reading line-by-line to get the events.
Splitting the events by room means that a client can just load events for a room as needed. Splitting events by arbitrary chunks allows the CRA to adapt to volumes within channels and avoid lots of tiny files for a room or having one large file for a room.
This doesn't achieve maildir's nice atomicity through using the mv
operation. The lack of atomicity in the writes could lead to incomplete data in the file. This could occur when, for instance, the CRA crashes before fully writing an event. A simple solution to this is for the CUAs to ignore invalid lines. When the CRA starts up it should check that each event file ends in a newline to ensure that if corrupt entries exist in the last line that they can be skipped properly. While it is tempting to provide a cleanup operation on the files, this invalidates the assumption that an event file is append-only. Additionally, the file lock aims to avoid concurrent processes (that are cooperating with the protocol) from invalidating and concurrently appending to the same file.
This is not intended to be the only storage for CUAs. This format cannot be everything to every client, thus they may need to implement their own caches for performance. It does however aim to be solid starting point for them to rebuild caches from.
File per message. I envision this might lead to too many files and doesn't have an inherent ordering scheme compared to appending to the file. File names could be ordered but I still think it might lead to too many files, particularly given that most chat messages are small.
One big file. This prevents the clients from efficiently viewing events from a subset of rooms, such as on startup.
I'd like to make some time to implement this in a Rust library, basic handling of the file layout things, and then try to integrate synchronising from a matrix server. In fact, here's a work in progress repo: matrixdir. Once that's working I think it would be interesting to try and integrate other protocols via matrix bridges, hopefully minimising the work needed as they should already be able to transform things to and from the matrix protocol format. Perhaps in parallel I would be able to make a start on moving chatters towards working with this new file format rather than having the synchronisation backends embedded in it.
If you have any comments on this proposal, or want to work on it in some way, let me know.