References

Amalgamation of code, commands, formulas, or tidbits I find myself repeatedly googling the syntax of

ML code

The most basic calls to openai client

import openai
base_url = None # if you need to specify a non-OpenAI model, set this to a base_url
api_key = os.environ['OPENAI_API_KEY'] # This will actually happen by default, but keeping the boilerplate here in case you need some other api key.
client = openai.OpenAI(base_url=base_url, api_key=api_key) 

prompt = "your prompt here!"
model = "gpt-4o-mini"
max_tokens = 128
response = client.chat.completions.create(
    model=model,
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=max_tokens,
)
print(response.choices[0].message.content)

Call openai client with threads

from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor, as_completed

shared_client = OpenAI(base_url=None, api_key=os.environ['OPENAI_API_KEY'])

def generate(content: str):
    return shared_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{'role': 'user', 'content': content}],
        max_tokens=512,
    ).choices[0].message.content

prompts = [
    "I am a beacon of knowledge blazing out across a black sea of ignorance.", 
    "I'm coming friends, wait for me!", 
    "Axe is a pig?", 
    "Lich gonna have your mana!", 
]

results = {}


# Change max_workers here
with ThreadPoolExecutor(max_workers=12) as executor:
    futures = {executor.submit(generate, prompt): index for index, prompt in enumerate(prompts)}
    for future in as_completed(futures):
        i = futures[future]
        result = future.result()
        results[i] = result

Using a tiktoken tokenizer

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

tokens = enc.encode("The profundities are mine to ransack!")
not_tokens = enc.decode([1, 2, 3, 4])

Using a HuggingFace Tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer("My totality eclipses the cosm!")['input_ids']
not_tokens = tokenizer.decode([1, 2, 3, 4])

Load a HF dataset

from datasets import load_dataset
dataset = load_dataset("Aiden07/dota2_instruct_prompt", name="train")

Python

Dict <–> JSON

import json
with open('data.json', 'r') as fp:
    data = json.load(fp)

with open('data.json', 'w') as fp:
    json.dump(data, fp, sort_keys=True, indent=4)

Read a .jsonl

import json

file_path = 'data.jsonl'
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        data = json.loads(line)
        print(data)

Write a .jsonl

import json

with open('output.jsonl', 'w') as outfile:
    for dict in list_of_dicts:
        json.dump(dict, outfile)
        outfile.write('\n')

Pretty print a dictionary/JSON

print(json.dumps(ur_dict, indent=4))

Page through a compressed JSON (credit to levy5674!)

import gzip
import json
import boto3

def stream_objects(filename):
    with gzip.open(filename, 'rt') as f:
        for line in f:
            yield json.loads(line)

json_path = "json_path.ndjson.gz"
for i, line in enumerate(stream_objects(json_filepath)):
    print(line.keys())
    break

List comprehension with conditionals why are they like this w/ different conditionals this annoys me

new_list = [x for x in list]
new_list = [x for x in list if x <3]
new_list = [x if x <3 else y for x in list ]

Debugging Interpreter

import IPython; IPython.embed()

Time something in seconds

import time
start = time.time()
function_to_time()
duration = time.time() - start
print(duration)

Or with a decorator! (credit to Suresh Kumar!)

from functools import wraps
import time


def timeit(func):
    @wraps(func)
    def timeit_wrapper(*args, **kwargs):
        start_time = time.perf_counter()
        result = func(*args, **kwargs)
        end_time = time.perf_counter()
        total_time = end_time - start_time
        print(f'Function {func.__name__}{args} {kwargs} Took {total_time:.4f} seconds')
        return result
    return timeit_wrapper


@timeit
def calculate_something(num):
    """
    Simple function that returns sum of all numbers up to the square of num.
    """
    total = sum((x for x in range(0, num**2)))
    return total

if __name__ == '__main__':
    calculate_something(10)
    calculate_something(100)
    calculate_something(1000)
    calculate_something(5000)
    calculate_something(10000)

Jupyter Magic Commands

%matplotlib inline

%load_ext autoreload
%autoreload 2

Startup Script (#!/bin/bash not needed if it’s not a script)

#!/bin/bash
jupyter lab \
    --port=8888 \
    --allow-root \
    --NotebookApp.token='' \
    --NotebookApp.password=''

Add conda env to jupyter

python -m ipykernel install --user --name <ur_env_name_here> --display-name "<ur display name here>"

Bash

find a file with wildcards

find . -name "*.csv"

unzip file

unzip file.zip -d destination_folder

Conda

Create new env

conda create --name <ur_env_name_here> python=3.8

Delete old env

conda env remove --name <ur_env_name_here> 

Git commands

Remove large file from commit history - careful with this one:

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch <filepath>’ HEAD

Checkout branch from remote

git checkout -b <branch_name> origin/<branch_name>

See staged changes

git diff --cached

See changes from latest commit

git diff HEAD~ HEAD

See changes from last X commits

git diff HEAD~X HEAD

Copy files to and from remote

From your local machine:

scp /path/to/local/file user@example.com:/home/name/dir

scp user@example.com:/home/name/dir/file /path/to/local/dir

You can replace user@example.com with predefined aliases in .ssh/config

F1 vs Precision vs Recall vs Accuracy

Precision - percentage of positive predictions that were correct - True Positive / (False Positives + True Positives)

Recall - percentage of positive class that was correctly identified - True Positive / (False Negatives + True Positives)

Accuracy - percentage of predictions that were correct - True Positive + True Negatives / (False Positives + False Negatives + True Positives + True Negatives)

F1 - 2 * (Precision * Recall) / (Precision + Recall) - harmonic mean of precision and recall

i.e. vs e.g.

I.e. stands for id est or ‘that is’ and is used to clarify the statement before it. E.g. means exempli gratia or ‘for example. ‘