Using bash tool sed, awk, grep, vim ... to edit multiple file. Case study DOM tag <html DOCTYPE, meta, charset...
https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
Sed cheatsheet
https://gist.github.com/asenchi/2291903
HTML meta tags (recommend way ?)
https://www.quackit.com/html_5/tags/html_meta_tag.cfm
Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />
<title>hoge</title>
</head>
<body>
</body>
</html>
Expected:
<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="EUC-JP">
<title>hoge</title>
<meta name="description" content="content here">
</head>
<body>
</body>
</html>
1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$ sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename
For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.
Special case:
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.
2. <html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja"> =>
<html lang="ja">
In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of second part line.
$ sed -i 's/ lang=/delete-me\n<html&/g' filename.html
& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d' filename.html
3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename
Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...
4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> (most common case).
=> <meta charset="EUC-JP">
This case seem similar with <html lang="en"> convert above.
$ sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename
Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.
The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)
Anyway, it work.
We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g' file2edit.html
This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.
At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.
Sed change multiple files
https://stackoverflow.com/questions/10445934/change-multiple-files
Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.
https://stackoverflow.com/questions/10708300/compare-two-files-ignoring-order
https://unix.stackexchange.com/questions/26284/how-can-i-use-sed-to-replace-a-multi-line-string
Sed cheatsheet
https://gist.github.com/asenchi/2291903
HTML meta tags (recommend way ?)
https://www.quackit.com/html_5/tags/html_meta_tag.cfm
Example
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja">
<head>
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />
<title>hoge</title>
</head>
<body>
</body>
</html>
Expected:
<!DOCTYPE html>
<html lang="ja">
<head>
<meta charset="EUC-JP">
<title>hoge</title>
<meta name="description" content="content here">
</head>
<body>
</body>
</html>
1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$ sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename
For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.
Special case:
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.
2. <html xmlns="http://www.w3.org/1999/xhtml " xml:lang="ja" lang="ja"> =>
<html lang="ja">
In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of second part line.
$ sed -i 's/ lang=/delete-me\n<html&/g' filename.html
& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d' filename.html
3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename
Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...
4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
or
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> (most common case).
=> <meta charset="EUC-JP">
This case seem similar with <html lang="en"> convert above.
$ sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename
Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.
The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)
Anyway, it work.
We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g' file2edit.html
This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.
At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.
Sed change multiple files
https://stackoverflow.com/questions/10445934/change-multiple-files
Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.
https://stackoverflow.com/questions/10708300/compare-two-files-ignoring-order
Comments
Post a Comment