Previously, we learned composite data types, and the basic control flows. This week, we will learn how to get data, convert data and save data. Before we step into scraping from the HTML, we will introduce a more convenient method - API. From this week on, you are gradually entering the door of "big data", the API allows you to get more types and a larger bass of data at the same time. After this week, you can use API method to request data from open data websites and social media platforms. And even create an automatic writing robot on Twitter.
- Understand API/ JSON and can retrieve data from online databases (twitter, GitHub, weibo, douban, ...)
- Understand basic file format like JSON and CSV.
- Be able to comfortably navigate through compound structures like {} and [].
- Be able to comfortably use (multiple layer of) for-loop to re-format data.
- Be able to use serialisers to handle input/ output to files.
The brief of Application Programming Interface (API):
Operate in client-and-server mode. Client does not have to download the full volume of data from server. Only use the data on demand. Server can handle intensive computations that not available by client. Server can send updated data upon request.
Step 1: Create the file
f = open("test.txt","w")
- You can write different kind of files by changing file name, like
name.txt
,name.json
,name.json
. But different file has different ways to write and read. - ‘w' means write.
Step 2: Write content
f.write('python tutorial')
f.close()
- f.close() to close the writing, this will close the instance of the file test.txt stored.
You also need to open the file first and then read the content.
Syntax:
f = open("test.txt", "r+")
contents=f.read() #read all
content =f.read(6) #read first 6 characters
- ‘r' means read. 'r+' means it will read from the beginning, if you want to print certain part of the string, you should use this method.
The append function is used to append to the file instead of overwriting it. To append to an existing file, simply open the file in append mode ("a")
Syntax:
h = open("Hello.txt", "a")
write("Hello World again")
h.close
The CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases,just like xlsx
file, which is used for storing your data and make it easy to read and write. You don't need to pip3 install csv. Just import csv
before using it.
The CSV has two basic functions: reader
and writer
.objects read and write.
Return a reader object which will iterate over lines in the given CSV file. Each row read from the CSV file is returned as a list of strings. No automatic data type conversion is performed.
Suppose we have a CSV file, name_list.csv
, you can download it here. The content is as follows:
name | id | gender | location | phone |
---|---|---|---|---|
Chico | 1742 | M | KLN | 3344 |
Ri | 1743 | F | LOS | 5168 |
Ivy | 1655 | F | MK | 7323 |
Example 1: How to read this CSV?
import csv
with open('chapter4-example-name_list.csv','r') as f: # open CSV
rows = csv.reader(f) # read CSV
for row in rows: #loop every row
print(row)
Note: If you download the csv, you should copy the csv file in the folder where your venv folder are. Usually, it's in the user path. All the files you write and read should in this folder.
Output:
['name', 'id', 'gender', 'location', 'phone']
['Chico', '1742', 'M', 'KLN', '3344']
['Ri', '1743', 'F', 'LOS', '5168']
['Ivy', '1655', 'F', 'MK', '7323']
Note:
with open(...) as f
means you give f a definition, which stands for opening the file. In the example,f
can be changed by any words you like. It just means you rename the step of the opening. It's equal tof = open('chapter4-example-name_list.csv',mode='r')
.open()
means open the file. If there is no such file, it will create a new one. If there is a existing one, writer function will clear all the previous content and then write the new content.
Return a writer object responsible for converting the user’s data into delimited strings. CSV file can be any object with a write() method.
Example 2: How to write a CSV file?
import csv
with open('name.csv','w') as f:
mywriter=csv.writer(f) #writer is the function to write something
mywriter.writerow(['Chico','Male']) #you can just use writer.writerow()
mywriter.writerow(['Ri','Female']) #write another row
Note:
w
means write. By the way, if you want to read the file, you can inputr
,representing "read".csv.writer()
means to write something in the name.csv file.writerow()
means write one row and then another row. The input should be list type.writerows()
means they will write row after row until loop all the elements from a list.- arguments in
writerow()
should be a list, becausecsv
function treat a row as a list, therefore you should use[]
to wrap up the argument.
Method 1:
import csv
mylist=[['KLN',1742,3344],['Los',1743,5168]]
with open('location.csv','w') as f:
mywriter=csv.writer(f)
mywriter.writerows(mylist) #if there are sub-list in the list
Method 2:
import csv
number_list = [11,22,33,44,55,66]
with open('number.csv','w',) as f:
mywriter=csv.writer(f)
mywriter.writerows([[number] for number in number_list]) # equal to a for loop
Output:
Method 3:
student_list = ['Chico','Ri','Ken','Aaliyah','Voodoo']
with open('student.csv','w',) as f:
writer=csv.writer(f)
for i in range(0,len(student_list)):
writer.writerow(student_list[i])
Output:
Note: I guess you already find what's wrong here. writerows()
means write all the rows in one time. It writes every item of the list into a row. Like we said, csv.writerow function treat a row as a list, for which it will regard the first row 'Chico' as a list of items with 5 characters, or 5 strings. Then it is put in 5 cells. So how to avoid spilt characters of a string into different cells?
Try change writer=csv.writer(f)
to writer=csv.writer(f,delimiter=' ')
, see what will happen.
- Write row ['hello','python'] in A1 in the CSV file
import csv
with open('hello.csv','w') as f:
mywriter=csv.writer(f)
mywriter.writerow(['hello','python']) #writerow('hello','python') won't work, you should put it in a list. Every item of the list will be write in a cell of the table.
- Use writerows to write [['spam','1'],['22','333'],['OK','Good']] in the CSV file
import csv
with open('test.csv','w') as f:
mywriter=csv.writer(f)
mywriter.writerows([['spam','1'],['22','333'],['OK','Good']])
There are 3 items, or 3 lists, therefore it will output 3 rows. And inside each row, there are different items,So it will write each item in one cell.
JSON (JavaScript Object Notation) is a lightweight data interchange format inspired by JavaScript object literal syntax.
- JSON is a syntax for storing and exchanging data.
- JSON is text, written with JavaScript object notation.
What does JSON looks like?
Example 3:
{
"firstName": "Jane",
"lastName": "Doe",
"hobbies": ["running", "sky diving", "singing"],
"age": 35,
"children": [
{
"firstName": "Alice",
"age": 6
},
{
"firstName": "Bob",
"age": 8
}
]
}
Why to use JSON?
The advantages of JSON:
- The data size is small. Compared with XML(another) file type to store and exchange data, JSON is small in size and faster in passing.
- The transmission speed is fast. JSON is much faster than XML.
- Data format is simple, easy to read and write, and the format is compressed.
- Easy to use with python, JSON is a form of
k-v
structure.
In simple terms, JSON is a dict, which has keys, each key corresponds to a value. The middle is separated by :
, the outermost is surrounded by {}
, and the different key-value pairs are separated by ,
. Example like this
{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}
If there is a case where a Key corresponds to multiple values, use [] to include all the corresponding values.
{'key1': ['v11', 'v12', 'v13'], 'key2': 'v22'}
Now you can go back to have a look of example 3, test yourself whether you have understand the JSON structure.
Syntax
json.dumps(obj, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding="utf-8", default=None, sort_keys=False, **kw)
Don't be panic, we do not have to use those all parameters. Basically, you need to know those two.
- If sort_keys is true (default: False), then the output of dictionaries will be sorted by key.
- If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped like Chinese. If ensure_ascii is false, these characters will be output as-is.
Example 4: Convert Python object data into JSON string
import json
data = {
'name' : 'ACME',
'shares' : 100,
'price' : 542.23
}
json_str = json.dumps(data)
json_str
Output:
'{"name": "ACME", "shares": 100, "price": 542.23}'
If you want to save the JSON string to a file, so that others can re-use this file, you can use open()
with the .write()
function to achieve that.
with open('json_data.json', "w") as f:
f.write(json_str)
f.close()
TIP: In Jupyter notebook, you can write shell commands after !
. cat
is essentially a shell command that reads the content of a file and output to the screen. It is a common way to check if the output (to a file) is intended. In notes-week-01.md, we have learned some useful commands like cd
, pwd
and ls
. Those can all be used here. Also recall how we install new Python modules in a Jupyter notebook: !pip install <package-name>
.
!cat json_data.json
{"name": "ACME", "shares": 100, "price": 542.23}
Syntax
json.loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
Example 5: Parse JSON string to Python internal data structure.
json_str = '''
{
"dataset": {
"train": { "type": "mnist", "data_set": "train", "layout_x": "tensor" },
"test": { "type": "mnist", "data_set": "test", "layout_x": "tensor" }
}
}
'''
import json
result = json.loads(json_str)
TIP: when you write a large chunk of data in Python, this "multi-line string", "block string literal", or "verbatim" is very useful. It helps you to reserve all the format like indentations in the data. This feature appears in nearly all programming languages and the official name is HEREDOC.
In the above example, you can .read()
the json_str
from a file instead of using HEREDOC. Try it yourself.
Output: the content (string representation) of result
{'dataset': {'test': {'data_set': 'test',
'layout_x': 'tensor',
'type': 'mnist'},
'train': {'data_set': 'train', 'layout_x': 'tensor', 'type': 'mnist'}}}
Actually, if we want to convert between json file with python object. We can directly use json.load
& json.dump
. The difference between loads
and load
or dumps
and dump
is that you can get the string by using -s
method. And sometimes, we need those strings to do other things instead of justing writing into files.
Example 6: Converting between JSON file and Python object
stu = {
"age": 20,
"score": 88,
"name": "Bob"
}
with open('stu.json', 'w') as f:
json.dump(stu, f) #converting Python object to JSON file
Open the stu.json file, it will output like:
{"age": 20, "score": 88, "name": "Bob"}
with open('stu.json', 'r') as f:
data = json.load(f) #converting JSON file to Python object
data
Output:
{"age": 20, "score": 88, "name": "Bob"}
Input/ Output is often abbreviated as I/O, or simply IO. This is a key topic in computer system design. Before this chapter, we mainly worked with the "internal brain" of a program. In order to make a computer program more useful, it needs to gather input from external world and share the results with people after calculation. In general, there several ways for a computer program to gather input:
- Define variables/ constants. It is a good practice to list your variables/ constanst in the beginning of the program. When you pass your script to the user, she can modify the variables/ constants in order to compute her own problem. This method is widely used in Chapter 1-3. The drawback is to require certain IT literacy of the end users and thus less friendly.
- stdin/ stdout/ stderr. This is the classical way to interact with computer terminal. Try
input()
(raw_input()
in Python3);sys.stdinput.read()
. - File. File is a major I/O method. When people organise large data analysis projects, upstream and downstream commonly use data files to exchange information. Refer to the section on file operation in Chapter 5 for more details.
- HTTP request.
- Chapter 6 is about using API. API comes in many forms. The current mainstream form is HTTP based RESTful API. Without bothering with the foreign terms, you can think of API as place for Q/A. You send a question to the server and the server replies you with the answer.
- Chapter 7/ 8 is about web scraper. You will make HTTP requests to get external data. Since HTML is semi-structured, one needs to parse the page into structured form, e.g. in Python's list/dict presentation.
- Database. A data project usually starts with data files and evolve to database as the size grows. Database provides indexing capability to allow one to easily retrieve a subset. Another advantage is in-database computation offered by nearly all DB solutions, which saves network traffic and the computing resources on client side. The topic goes beyond this open book.
- Graphical User Interface (GUI). This gives your end uesr an experience like other desktop software. Interacting with display could be complex. We usually refrain from coding from scratch but use frameworks like
qt
,kivy
, etc. The topic goes beyond this open book. - Command Line Interface (CLI). In Chapter 1, we used several shell commands, in the form of
command {arg1} {arg2} ...
. The commmand line arguments are a mechanism to give input to a program. One can learn how to design beautiful CLI from UNIX systems.argparse
is a classical package to help design CLI.docopt
provides an innovative way to design CLI: the developer only needs to write command documentation in the comment block.
TODO
- Python official doc about json
If you have any questions, or seek for help troubleshooting, please create an issue here