In today’s post I’m going to explore a fun data analysis tool for beginners – DataBasic.io. I had a great time trying it out – you should have a go too!
DataBasic performs simple but insightful operations on data. No technical expertise (beyond being able to navigate a webpage) is required. DataBasic comprises three tools to help you understand textual and tabular (tables) data: WordCounter, samediff and WTFcsv.
Word Counter gives you the most frequently used words, bigrams (two consecutive words) and trigrams (three consecutive words) in a text excerpt. They provide sample text for you to look at, e.g. Clinton and Trump’s speeches or Bob Dylan’s lyrics.
You can input your own data by uploading your own files, pasting in text or pasting a link. So you don’t have to process your data much, which makes using this tool easier.
- Feature suggestion: It would be nice if you could paste multiple links to collate text from different articles.
I like the ‘What do I do next?‘ sections on the results pages, which suggest next actions you can take to tell stories with your data. For example, knowing that ‘know’ is a popular word is interesting, but it’s even more insightful when combined with how the word ‘know’ is used in the most frequent bigrams (e.g. ‘you know’, ‘you know what’).
There are also tooltips next to technical terms like ‘bigram’ to explain what they mean.
A follow-up question to looking at Clinton and Trump’s speeches would be ‘what are the similarities and differences between the two?’ SameDiff does just that: it takes two sets of texts and compares them to each other, giving you a similarity score based on how frequently words are used in each (cosine similarity score). It also gives you words that are used frequently only in Clinton’s speeches or Trump’s speeches and words that are used frequently in both.
When you mouse over each word, you get the number of times the word occurs as a percentage of the total number of words in the text.
WTFcsv gives you the frequencies of each attribute (column) in a table, presenting it as a histogram or bar chart or word cloud depending which visualisation is most appropriate. For example, we can see from the chart above that the ratio between people who survived and people who didn’t is about 3.5:5.5 ~2:3.
PS: It’s called WTFcsv because tabular data is often stored in CSV (comma-separated-value) files.
WTFcsv is thus a quick way to get a quick overview of your data. You need only upload the file.
Similar to WordCount, there is also a ‘What do I do next?’ section that prompts you to ask questions like ‘Can I compare the PassengerId column to the SibSp (number of siblings and spouses people have on board the Titanic) column?’ or ‘Is it surprising that 0 is the most frequent value in the Parch (number of parents or children people have on board the Titanic) column?’. I don’t know if this is available only for the sample data, but it’s a great feature and a great start nonetheless.
Results from WordCounter and SameDiff can be downloaded as CSV files, so it’s easy to fit DataBasic into your workflow. Props to the DataBasic team for making this set of friendly but effective tools. Be sure to try them out!