Gmail to SQLite
Gmail to SQLite
This is a script to download emails from Gmail and store them in a SQLite database for further analysis. I find it extremely useful to have all my emails in a database to run queries on them. For example, I can find out how many emails I received per sender, which emails take the most space, and which emails from which sender I never read.
Installation
- Clone this repository:
git clone https://github.com/marcboeker/gmail-to-sqlite.git
. - Install the requirements:
pip install -r requirements.txt
- Create a Google Cloud project here.
- Open Gmail in API & Services and activate the Gmail API.
- Open the OAuth consent screen and create a new consent screen. You only need to provide a name and contact data.
- Next open Create OAuth client ID and create credentials for a
Desktop app
. Download the credentials file and save it undercredentials.json
in the root of this repository.
Here is a detailed guide on how to create the credentials: https://developers.google.com/gmail/api/quickstart/python#set_up_your_environment.
Usage
Sync all emails
- Run the script:
python main.py sync --data-dir path/to/your/data
where--
is the path where all data is stored. This creates a SQLite database in
and stores the user credentials under/messages.db
./credentials.json - After the script has finished, you can query the database using, for example, the
sqlite3
command line tool:sqlite3
./messages.db - You can run the script again to sync all new messages. Provide
--full-sync
to force a full sync. However, this will only update the read status, the labels, and the last indexed timestamp for existing messages.
Sync a single message
python main.py sync-message --data-dir path/to/your/data --message-id
Commandline parameters
usage: main.py [-h] [--data-dir DATA_DIR] [--update] {sync, sync-message}
Main commands:
sync Sync emails from Gmail to the database.
sync-message Sync a single message from Gmail to the database.
--data-dir DATA_DIR Path to the directory where all data is stored.
--full-sync Force a full sync.
--message-id MESSAGE_ID Sync only the message with the given message id.
Schema
CREATE TABLE IF NOT EXISTS "messages" ( "id" INTEGER NOT NULL PRIMARY KEY, -- internal id "message_id" TEXT NOT NULL, -- Gmail message id "thread_id" TEXT NOT NULL, -- Gmail thread id "sender" JSON NOT NULL, -- Sender as JSON in the form {"name": "Foo Bar", "email": "[email protected]"} "recipients" JSON NOT NULL, -- JSON object: { -- "to": [{"email": "[email protected]", "name": "Foo Bar"}, ...], -- "cc": [{"email": "[email protected]", "name": "Foo Bar"}, ...], -- "bcc": [{"email": "[email protected]", "name": "Foo Bar"}, ...] --} "labels" JSON NOT NULL, -- JSON array: ["INBOX", "UNREAD", ...] "subject" TEXT NOT NULL, -- Subject of the email "body" TEXT NOT NULL, -- Extracted body either als HTML or plain text "size" INTEGER NOT NULL, -- Size reported by Gmail "timestamp" DATETIME NOT NULL, -- When the email was sent/received "is_read" INTEGER NOT NULL, -- 0=Unread, 1=Read "is_outgoing" INTEGER NOT NULL, -- 0=Incoming, 1=Outgoing "last_indexed" DATETIME NOT NULL -- Timestamp when the email was last seen on the server );
Example queries
Get the number of emails per sender
SELECT sender->>'$.email', COUNT(*) AS count FROM messages GROUP BY sender->>'$.email' ORDER BY count DESC
Show the number of unread emails by sender
This is great to determine who is spamming you the most with uninteresting emails.
SELECT sender->>'$.email', COUNT(*) AS count FROM messages WHERE is_read = 0 GROUP BY sender->>'$.email' ORDER BY count DESC
Get the number of emails for a specific period
- For years:
strftime('%Y', timestamp)
- For months in a year:
strftime('%m', timestamp)
- For days in a month:
strftime('%d', timestamp)
- For weekdays:
strftime('%w', timestamp)
- For hours in a day:
strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count FROM messages GROUP BY period ORDER BY count DESC
Find all newsletters and group them by sender
This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use
SELECT sender->>'$.email', COUNT(*) AS count FROM messages WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%' GROUP BY sender->>'$.email' ORDER BY count DESC
Show who has sent the largest emails in MB
SELECT sender->>'$.email', sum(size)/1024/1024 AS size FROM messages GROUP BY sender->>'$.email' ORDER BY size DESC
Count the number of emails that I have sent to myself
SELECT count(*) FROM messages WHERE EXISTS ( SELECT 1 FROM json_each(messages.recipients->'$.to') WHERE json_extract(value, '$.email') = '[email protected]' ) AND sender->>'$.email' = '[email protected]'
List the senders who have sent me the largest total volume of emails in megabytes
SELECT sender->>'$.email', sum(size)/1024/1024 as total_size FROM messages WHERE is_outgoing=false GROUP BY sender->>'$.email' ORDER BY total_size DESC
Roadmap
What's Your Reaction?






