I have an existing application that uses a mongodb to store metadata on files to make them searchable. Since searching in those records is not very time sensitive for me, I thought about migrating the whole database to S3/Athena. The advantage here would be, that there would be no longer a database to maintain, since Athena runs serverless on AWS.
I started with a JSON-export of mongodb, to have a look at the document structure.
Unfortunately, this only worked until I discovered the different format that mongodb is storing my filesize information.
So I decided so export my data through the following php script to unify the different number formats into one.
So here is my document structure to begin with.
I uploaded the whole document into a S3 bucket and then started to look into Athena.
As I found out, Athena (and hereby presto), has some issues with this source format, since it cannot access any element containing a special character like ‘$’ or ‘_’. For those characters I needed to create a mapping for the SerDe to handle those fields.
Here is the table definition I came up with:
In addition to searching column-wise, mongo also supports searching per RegEx through the complete document. Since Athena only works in columns and rows I created a second table, which only contains the raw JSON string.
Now whenever I needed the Regex capability, I could join in the raw-table and run the RegEx on this column. Since Athena separates the read-logic from the actual storage, I also did not need to duplicate my data pool for this.